tSNE vs. UMAP: Global Structure Preservation

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

Abstract

In this notebook, we will discuss the importance of global structure preservation by the tSNE and UMAP neighbor graph algorithms. we will check to what extent both algorithms can preserve global structure of syntetic and real-world scRNAseq data, as well as discuss where this originates mathematically from.

Table of Contents:

Global Structure Preservation: Why It Is Important

If you use tSNE and UMAP only for visulaization of high-dimensional data, you probably have never thought about how much of global structure they can preserve. Indeed, both tSNE and UMAP were designed to predominanlty preserve local structure that is to group neigboring data points together which provides very informative visualization.

In [177]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/GyroidSculpture.jpg', width=1000)
Out[177]:

However, if you want to make a next step beyond visualization and run clustering you will have difficulty doing it on the original data due to the Curse of Dimensionality and the choice of a proper distance metric, instead clustering on tSNE or UMAP reduced dimensions can be more promising. Below, I use scRNAseq data from Björklund et al. Innate Limphoid Cells (ILC) and compare K-means clustering on 1) raw expression data, 2) significant 30 Principal Components (PCs), 3) tSNE 2D representation, and 4) 30 UMAP components. As we can see, due to the Curse of Dimensionality and non-linearity of the scRNAseq data, K-means clustering failed on the raw expression data and the 30 PCs were apparently also too high-dimensional space for K-means to succeed. In contrast, non-linear dimension reduction down to 2 for tSNE and 30 for UMAP resulted in almost perfect agreement between clustering and tSNE dimension reduction.

In [3]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/Kmeans_tSNE_SignPCs_RawExpr_UMAPComp.png', width=2000)
Out[3]:

However, this is where you really want to make sure that the tSNE and UMAP components capture enough of global structure in the original data, that is preserve distance between clusters of data points, in order to obtain a correct hierarchy between the data points or clusters of data points that were distant in high dimensions. Although not perfectly correct, in layman's terms one can say that:

  • Local structure preservation is important for visualization
  • Global structure preservation is important for clustering

It is widely acknowledged that clustering on 2D tSNE representation is not a good idea because the distances between clusters (global structure) are not guaranteed to be preserved, therefore proximity of two clusters on a tSNE plot does not imply a biological similarity between the two cell populations. Clustering on a 2D UMAP representation may be better idea since UMAP preserves more of a global structure, however this is still to be proven. What we definitely see from the plot above is that clustering on a number of UMAP components outperformes clustering on a number of PCs since 30 UMAP components retain more variation in the data than 30 PCs. In contrast, tSNE can not deliver more than 3 components due to technical limitations, so clustering on a number of tSNE components is basically impossible. So clustering - this is where you start seeing the difference between tSNE and UMAP.

What Exactly We Mean By Global Structure Preservation

Remember that the goal of dimension reduction is to transform the data from high- to low-dimensional space, i.e. represent the cells / samples in low-dimensional space without loosing too much information, i.e. preserving distances between both close and distant samples / cells. What exactly do we mean when we say that a dimension reduction algorithm is capable of preserving global structure? Both tSNE and UMAP define the probability to observe points at a certain distance to belong to the following exponential family:

$$p_{ij}\approx \displaystyle e^{\displaystyle -\frac{(x_i-x_j)^2}{2\sigma_i^2}}$$

Here $\sigma_i$ is a parameter responsible for how much cells / samples can "feel" each other. Since $\sigma_i$ is a finite value, i.e. does not go yo infinity, every data point can "feel" the presence only its closest neighbors and not the distant points, therefore both tSNE and UMAP are neighbor graph algorithms and hence preserve local structure of the data. However, in the limit $\sigma_i \rightarrow \infty$ there is a chance that every point "remembers" every other point, so in this limit in theory both tSNE and UMAP can preserve global structure. However, it is not the $\sigma_i$ that the hyperparameter of tSNE and UMAP but the perplexity and number of nearest neigbors n_neigbors, respectively. Let us check what perplexity and n_neighbors values lead to the limit $\sigma_i \rightarrow \infty$. For this purpose we will take one syntetic and one real-world scRNAseq data sets and compute how mean $\sigma$ depend on the perplexity and n_neighbors.

To check how much PCA / MDS, tSNE and UMAP are capable of preserving global structure, let us contruct a syntetic data set representing the world map with 5 continents: Eurasia, Africa, North America, South America and Australia. We sample a few thousand points from the areas of the continents and will use them for testing the dimension reduction techniques.

In [3]:
import cartopy
import matplotlib
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader

shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
                                        category='cultural', name=shapename)

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    #print(country.attributes['NAME_LONG'])
    if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('NorthAmerica.png')
plt.close()
        
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela', 
                                           'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('SouthAmerica.png')
plt.close()

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Australia']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Australia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia', 
                                          'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden', 
                                           'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway', 
                                           'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria', 
                                           'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland', 
                                           'United Kingdom', 'Saudi Arabia', 'Hungary']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Eurasia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
                                           'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia', 
                                           'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
                                           'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic', 
                                           'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia', 
                                           'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa', 
                                           'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
                                           'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 
                                           'Nigeria', 'Cameroon']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Africa.png')
plt.close()


rng = np.random.RandomState(123)
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})


N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)

N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)

N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)

N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)

N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)

plt.title('Original World Map Data Set', fontsize = 25)
plt.xlabel('Dimension 1', fontsize = 22); plt.ylabel('Dimension 2', fontsize = 22)

plt.show()

As a result we have a collection of 2D data points belonging to 5 clusters / continents. So far this is planar linear geometry, therefore linear dimensionality reduction techniques should be able to reconstruct the original data.

In [4]:
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)
(3023, 2)
(3023,)

We will start with linear dimensiona reduction techniques: PCA and MDS that can perfectly preserve global distances.

In [5]:
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 25)
plt.xlabel("PCA1", fontsize = 22)
plt.ylabel("PCA2", fontsize = 22)
plt.show()
In [127]:
from sklearn.manifold import MDS
model_mds = MDS(n_components = 2, random_state = 123, metric = True)
mmds = model_mds.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(mmds[:, 0], mmds[:, 1], c = y, s = 50)
plt.title('Metric Multi-Dimensional Scaling (MDS)', fontsize = 20)
plt.xlabel("MDS1", fontsize = 20); plt.ylabel("MDS2", fontsize = 20)
plt.show()

We confirm that up to linear transformations such as flip, shift and rotation, the original data set is very well reconstructed by the PCA and MDS linear dimension reduction techniques. Let us now check how non-linear dimensiona reduction techniques such as tSNE and UMAP perform on the 2D linear data. We deliberately select large perplexity and n_neighbors hyperparameters that should result in $\sigma_i \rightarrow \infty$ and therefore better preservation of the global structure.

In [6]:
from sklearn.manifold import TSNE
X_reduced = PCA(n_components = 2).fit_transform(X)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 500, 
             init = X_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 25); plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 1501 nearest neighbors...
[t-SNE] Indexed 3023 samples in 0.001s...
[t-SNE] Computed neighbors for 3023 samples in 0.674s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.060978
[t-SNE] Computed conditional probabilities in 2.641s
[t-SNE] Iteration 50: error = 43.8401680, gradient norm = 0.0025554 (50 iterations in 2.953s)
[t-SNE] Iteration 100: error = 43.6828728, gradient norm = 0.0001747 (50 iterations in 2.801s)
[t-SNE] Iteration 150: error = 43.6869354, gradient norm = 0.0002149 (50 iterations in 2.696s)
[t-SNE] Iteration 200: error = 43.6884346, gradient norm = 0.0004383 (50 iterations in 2.573s)
[t-SNE] Iteration 250: error = 43.6885376, gradient norm = 0.0001415 (50 iterations in 2.606s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 43.688538
[t-SNE] Iteration 300: error = 0.1987910, gradient norm = 0.0011702 (50 iterations in 2.674s)
[t-SNE] Iteration 350: error = 0.1548541, gradient norm = 0.0001670 (50 iterations in 2.829s)
[t-SNE] Iteration 400: error = 0.1482517, gradient norm = 0.0000505 (50 iterations in 2.858s)
[t-SNE] Iteration 450: error = 0.1469540, gradient norm = 0.0000291 (50 iterations in 2.983s)
[t-SNE] Iteration 500: error = 0.1465182, gradient norm = 0.0000220 (50 iterations in 2.913s)
[t-SNE] Iteration 550: error = 0.1463438, gradient norm = 0.0000148 (50 iterations in 2.902s)
[t-SNE] Iteration 600: error = 0.1462867, gradient norm = 0.0000125 (50 iterations in 2.904s)
[t-SNE] Iteration 650: error = 0.1462306, gradient norm = 0.0000110 (50 iterations in 4.097s)
[t-SNE] Iteration 700: error = 0.1461740, gradient norm = 0.0000088 (50 iterations in 3.267s)
[t-SNE] Iteration 750: error = 0.1461396, gradient norm = 0.0000078 (50 iterations in 2.887s)
[t-SNE] Iteration 800: error = 0.1461195, gradient norm = 0.0000079 (50 iterations in 2.925s)
[t-SNE] Iteration 850: error = 0.1460714, gradient norm = 0.0000074 (50 iterations in 2.897s)
[t-SNE] Iteration 900: error = 0.1460554, gradient norm = 0.0000064 (50 iterations in 5.802s)
[t-SNE] Iteration 950: error = 0.1460518, gradient norm = 0.0000067 (50 iterations in 3.812s)
[t-SNE] Iteration 1000: error = 0.1461060, gradient norm = 0.0000065 (50 iterations in 3.063s)
[t-SNE] KL divergence after 1000 iterations: 0.146106
In [7]:
import warnings
warnings.filterwarnings("ignore")

from umap import UMAP
X_reduced = PCA(n_components = 2).fit_transform(X)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 500, 
             init = X_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 25); plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(a=None, angular_rp_forest=False, b=None,
     init=array([[ 0.25580736, -0.08346226],
       [ 0.21164187, -0.0044089 ],
       [ 0.25814581,  0.02748583],
       ...,
       [-0.04541859,  0.06084731],
       [ 0.03837112,  0.01024752],
       [-0.01219052,  0.02886706]]),
     learning_rate=1, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=1, n_components=2, n_epochs=1000,
     n_neighbors=500, negative_sample_rate=5, random_state=None,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=2)
Construct fuzzy simplicial set
Tue Mar  3 17:49:18 2020 Finding Nearest Neighbors
Tue Mar  3 17:49:18 2020 Finished Nearest Neighbor Search
Tue Mar  3 17:49:21 2020 Construct embedding
	completed  0  /  1000 epochs
	completed  100  /  1000 epochs
	completed  200  /  1000 epochs
	completed  300  /  1000 epochs
	completed  400  /  1000 epochs
	completed  500  /  1000 epochs
	completed  600  /  1000 epochs
	completed  700  /  1000 epochs
	completed  800  /  1000 epochs
	completed  900  /  1000 epochs
Tue Mar  3 17:49:52 2020 Finished embedding

The quality of visualizations are comparable between tSNE and UMAP in a sense that all the 5 clusters / continents are well distinguishable. However, we can see that the original shapes of the continents are a bit better preserved by UMAP. In addition the South America seems to be placed between Africa and North America by tSNE while it is correctly placed on the same longitude as North America by UMAP.

Previously, it was a 2D data point collection on the linear planar surface. Let us now embedd the 2D data points into the 3D non-linear manifold such as swiss roll. Swill roll represents a kind of Archimedean spiral in a 3D space.

In [70]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/SwissRoll.png', width=2000)
Out[70]:

When we project our 2D world map onto 3D non-linear manifold, the intrinsic dimensionality of the data is still two even though they points are embedded into 3D space. Therefore linear dimension reduction techniques usually fail reconstructing data on a non-linear manifold as they try to preserve distances between all pairs of points including the ones that are not neighboring on a manifold. Let's project the world map onto the swiss roll.

In [9]:
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)

X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
Out[9]:
(3023, 3)
In [53]:
from mpl_toolkits import mplot3d
plt.figure(figsize=(20,15))
ax = plt.axes(projection = '3d')
ax.scatter3D(X_swiss_roll[:, 0], X_swiss_roll[:, 1], X_swiss_roll[:, 2], c = y)
plt.show()

First we will check how linear dimension reduction such as PCA and MDS perform on the swiss roll.

In [10]:
from sklearn.decomposition import PCA
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(X_swiss_roll_reduced[:, 0], X_swiss_roll_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 25)
plt.xlabel("PCA1", fontsize = 22); plt.ylabel("PCA2", fontsize = 22)
plt.show()
In [69]:
from sklearn.manifold import MDS
model_mds = MDS(n_components = 2, random_state = 123, metric = True)
mds = model_mds.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(mds[:, 0], mds[:, 1], c = y, s = 50)
plt.title('Metric Multi-Dimensional Scaling (MDS)', fontsize = 20)
plt.xlabel("MDS1", fontsize = 20); plt.ylabel("MDS2", fontsize = 20)
plt.show()

As expected both PCA nd MDS fail reconstructing the original data since they try to preserve global distances while for swiss roll it is more important to preserve local neighborhood. Let us now see whether tSNE and UMAP can do it better. Note that for both tSNE and UMAP we start with PCA as initialization for the gradient descent algorithm.

In [11]:
from sklearn.manifold import TSNE
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 50, 
             init = X_swiss_roll_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 25); plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 3023 samples in 0.003s...
[t-SNE] Computed neighbors for 3023 samples in 0.140s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.033405
[t-SNE] Computed conditional probabilities in 0.297s
[t-SNE] Iteration 50: error = 56.7442093, gradient norm = 0.0189936 (50 iterations in 0.606s)
[t-SNE] Iteration 100: error = 54.1754074, gradient norm = 0.0130908 (50 iterations in 0.621s)
[t-SNE] Iteration 150: error = 53.1689682, gradient norm = 0.0095972 (50 iterations in 0.603s)
[t-SNE] Iteration 200: error = 52.6251984, gradient norm = 0.0104257 (50 iterations in 0.585s)
[t-SNE] Iteration 250: error = 52.2827072, gradient norm = 0.0080654 (50 iterations in 0.583s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 52.282707
[t-SNE] Iteration 300: error = 0.7452142, gradient norm = 0.0010242 (50 iterations in 0.595s)
[t-SNE] Iteration 350: error = 0.4960896, gradient norm = 0.0003599 (50 iterations in 0.599s)
[t-SNE] Iteration 400: error = 0.4164004, gradient norm = 0.0002017 (50 iterations in 0.593s)
[t-SNE] Iteration 450: error = 0.3832679, gradient norm = 0.0001595 (50 iterations in 0.584s)
[t-SNE] Iteration 500: error = 0.3684293, gradient norm = 0.0001464 (50 iterations in 0.605s)
[t-SNE] Iteration 550: error = 0.3607625, gradient norm = 0.0001210 (50 iterations in 0.655s)
[t-SNE] Iteration 600: error = 0.3553360, gradient norm = 0.0001125 (50 iterations in 0.630s)
[t-SNE] Iteration 650: error = 0.3510026, gradient norm = 0.0001069 (50 iterations in 0.603s)
[t-SNE] Iteration 700: error = 0.3474116, gradient norm = 0.0000999 (50 iterations in 0.599s)
[t-SNE] Iteration 750: error = 0.3445701, gradient norm = 0.0000983 (50 iterations in 0.595s)
[t-SNE] Iteration 800: error = 0.3423376, gradient norm = 0.0000935 (50 iterations in 0.594s)
[t-SNE] Iteration 850: error = 0.3403157, gradient norm = 0.0000833 (50 iterations in 0.695s)
[t-SNE] Iteration 900: error = 0.3384252, gradient norm = 0.0000810 (50 iterations in 0.604s)
[t-SNE] Iteration 950: error = 0.3368797, gradient norm = 0.0000751 (50 iterations in 0.614s)
[t-SNE] Iteration 1000: error = 0.3352237, gradient norm = 0.0000801 (50 iterations in 0.595s)
[t-SNE] KL divergence after 1000 iterations: 0.335224
In [12]:
import warnings
warnings.filterwarnings("ignore")

from umap import UMAP
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 50, 
             init = X_swiss_roll_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 25); plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(a=None, angular_rp_forest=False, b=None,
     init=array([[ 0.12535979,  0.40356518],
       [ 0.00599581,  0.49934623],
       [ 0.12201155,  0.40671414],
       ...,
       [-0.43454162, -0.28447539],
       [-0.55506668,  0.15485573],
       [-0.53099092, -0.11675039]]),
     learning_rate=1, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=1, n_components=2, n_epochs=1000,
     n_neighbors=50, negative_sample_rate=5, random_state=None,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=2)
Construct fuzzy simplicial set
Tue Mar  3 22:15:22 2020 Finding Nearest Neighbors
Tue Mar  3 22:15:23 2020 Finished Nearest Neighbor Search
Tue Mar  3 22:15:23 2020 Construct embedding
	completed  0  /  1000 epochs
	completed  100  /  1000 epochs
	completed  200  /  1000 epochs
	completed  300  /  1000 epochs
	completed  400  /  1000 epochs
	completed  500  /  1000 epochs
	completed  600  /  1000 epochs
	completed  700  /  1000 epochs
	completed  800  /  1000 epochs
	completed  900  /  1000 epochs
Tue Mar  3 22:15:43 2020 Finished embedding

A very obvious bug of tSNE that we immediately see is that if one follows the line from South America towards Africa, one passes Eurasia that was placed by tSNE for some reason between South America and Africa. In contrast, UMAP correctly places Africa between South America and Eurasia.

Obviously, both tSNE and UMAP reconstructed the the original world map better than the PCA and MDS. This is because linear methods such as PCA and MDS get full affinity matrix as input and try to preserve distances between all pairs of points while non-linear neigbor graph methods such as tSNE / UMAP and Locally Linear Embedding (LLE) get sparse affinity matrix (KNN-graph) as input and preserve only distances between nearest neighbors.

In [87]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/MDS_vs_LLE.png', width=2000)
Out[87]:

The quality of both tSNE and UMAP visualizations are comparable, although we used quite different hyperparameters to reach this similarity in the outcome. We used large learning rate 200 (default 200) and quite a low perplexity 50 (default 30) for tSNE, while we used small learning rate 1 (default 1) and large number of nearest neighbors n_neighbor = 500 (default 15) for UMAP. These are very important hyperparameters as they determine contributions from initialization and cost function to the final embeddings. From coding tSNE and UMAP from scratch it becomes clear that there are two major contributions to the global structure preservation when we run the gradient descent algorithm updating embeddings according to

$$y_i = y_i -\mu \frac{\partial \rm{Cost}}{\partial y_i}$$

We see that the final embedding will depend on:

  • initialization (we start wtih random, PCA or Laplacian Eigenmaps coordinates)
  • internal algorithmic peculiarities that boil down to the cost function, that is Kullback-Leibler (KL) divergence for tSNE and Cross-Entropy (CE) for UMAP
In [71]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/GradDesc.png', width=2000)
Out[71]:

We are going to compare the two algorithms for their global structure preservation on the world map syntetic data set. People with biological background tend to ignore this approach and prefer to immediately jump to the real-world noisy biological data. However, these simple data sets allow to learn something fundamental about the algorithms, so here we will use more of the good old physics approach.

When we were running PCA / MDS and tSNE / UMAP on 2D linear and 3D non-linear data, we came to opposite conclusions:

  • PCA / MDS perfectly reconstructed the 2D linear data set while performance of tSNE / UMAP was tolarable but worse than PCA /MDS
  • PCA / MDS totally failed reconstructing 3D non-linear data while performance of tSNE / UMAP was much better

Running tSNE and UMAP we started with PCA as an initialization. Therefore any difference in the final output can be only explained by the term with the cost function $\displaystyle \mu \frac{\partial \rm{Cost}}{\partial y_i}$. Let us compute how this term behaves for the world map syntetic data set. For this purpose we need to understand the dependence between perplexity and $\sigma$ parameter in the denominator of the power of the exponential probability to observe points at a certain distance

$$p_{ij}\approx \displaystyle e^{\displaystyle -\frac{(x_i-x_j)^2}{2\sigma_i^2}}$$

Here we are going to compute the function $\sigma(Perplexity)$ for the world map syntetic data set.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

import warnings
warnings.filterwarnings("ignore")

#X_train = X;
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)

dist = np.square(euclidean_distances(X_train, X_train))

plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
        exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
        exp_distance[dist_row] = 0
        prob_not_symmetr = exp_distance / np.sum(exp_distance)
        return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(3, X_train.shape[0], 10):
    
    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, PERPLEXITY)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
    
    my_perp.append(PERPLEXITY)
    my_sigma_tSNE.append(np.mean(sigma_array))
    
plt.figure(figsize=(20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
sns.distplot(sigma_array)
plt.title("tSNE: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
                1110020A21Rik  1110046J04Rik  1190002F15Rik  1500015A07Rik
SS2_15_0048_A3            0.0            0.0            0.0            0.0
SS2_15_0048_A6            0.0            0.0            0.0            0.0
SS2_15_0048_A5            0.0            0.0            0.0            0.0
SS2_15_0048_A4            0.0            0.0            0.0            0.0

This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)
Perplexity = 3, Mean Sigma = 3.92004231500892
Perplexity = 13, Mean Sigma = 5.520980451360095
Perplexity = 23, Mean Sigma = 6.091983624676752
Perplexity = 33, Mean Sigma = 6.481132027823166
Perplexity = 43, Mean Sigma = 6.794533916025855
Perplexity = 53, Mean Sigma = 7.0666387760439395
Perplexity = 63, Mean Sigma = 7.312361754518648
Perplexity = 73, Mean Sigma = 7.539448125402354
Perplexity = 83, Mean Sigma = 7.751667299750131
Perplexity = 93, Mean Sigma = 7.95146206903724
Perplexity = 103, Mean Sigma = 8.140649209475384
Perplexity = 113, Mean Sigma = 8.320984227697277
Perplexity = 123, Mean Sigma = 8.493982879809161
Perplexity = 133, Mean Sigma = 8.661166249706758
Perplexity = 143, Mean Sigma = 8.823815670759318
Perplexity = 153, Mean Sigma = 8.982999364757005
Perplexity = 163, Mean Sigma = 9.139716292226781
Perplexity = 173, Mean Sigma = 9.29464441437961
Perplexity = 183, Mean Sigma = 9.448621526110772
Perplexity = 193, Mean Sigma = 9.602192393894303
Perplexity = 203, Mean Sigma = 9.755941742625316
Perplexity = 213, Mean Sigma = 9.910426326304174
Perplexity = 223, Mean Sigma = 10.066091015352217
Perplexity = 233, Mean Sigma = 10.22344461366451
Perplexity = 243, Mean Sigma = 10.382794801083357
Perplexity = 253, Mean Sigma = 10.54484617776711
Perplexity = 263, Mean Sigma = 10.709790544136942
Perplexity = 273, Mean Sigma = 10.878288546088022
Perplexity = 283, Mean Sigma = 11.050645199567912
Perplexity = 293, Mean Sigma = 11.227454553103314
Perplexity = 303, Mean Sigma = 11.409138834010289
Perplexity = 313, Mean Sigma = 11.596250800447091
Perplexity = 323, Mean Sigma = 11.789405812098327
Perplexity = 333, Mean Sigma = 11.989055399122185
Perplexity = 343, Mean Sigma = 12.195846887940135
Perplexity = 353, Mean Sigma = 12.41039031044731
Perplexity = 363, Mean Sigma = 12.633379611223104
Perplexity = 373, Mean Sigma = 12.865303614952046
Perplexity = 383, Mean Sigma = 13.106974809529396
Perplexity = 393, Mean Sigma = 13.358840729271233
Perplexity = 403, Mean Sigma = 13.621652592493835
Perplexity = 413, Mean Sigma = 13.895696767881596
Perplexity = 423, Mean Sigma = 14.181664536119174
Perplexity = 433, Mean Sigma = 14.479820954733055
Perplexity = 443, Mean Sigma = 14.790566939881394
Perplexity = 453, Mean Sigma = 15.114091628090629
Perplexity = 463, Mean Sigma = 15.45076530072942
Perplexity = 473, Mean Sigma = 15.800857011166364
Perplexity = 483, Mean Sigma = 16.164673107296395
Perplexity = 493, Mean Sigma = 16.542802309856736
Perplexity = 503, Mean Sigma = 16.936095733216355
Perplexity = 513, Mean Sigma = 17.34547641690217
Perplexity = 523, Mean Sigma = 17.77229095970452
Perplexity = 533, Mean Sigma = 18.218338822519314
Perplexity = 543, Mean Sigma = 18.685885647821692
Perplexity = 553, Mean Sigma = 19.1776659235608
Perplexity = 563, Mean Sigma = 19.697141380949393
Perplexity = 573, Mean Sigma = 20.248696790727156
Perplexity = 583, Mean Sigma = 20.837773157897608
Perplexity = 593, Mean Sigma = 21.471276629570475
Perplexity = 603, Mean Sigma = 22.15808330301466
Perplexity = 613, Mean Sigma = 22.909743825816577
Perplexity = 623, Mean Sigma = 23.74180069182838
Perplexity = 633, Mean Sigma = 24.675720896800804
Perplexity = 643, Mean Sigma = 25.741958085385114
Perplexity = 653, Mean Sigma = 26.98616741755821
Perplexity = 663, Mean Sigma = 28.480371283419306
Perplexity = 673, Mean Sigma = 30.348038540206144
Perplexity = 683, Mean Sigma = 32.82577365470332
Perplexity = 693, Mean Sigma = 36.454818768208256
Perplexity = 703, Mean Sigma = 42.95848334967757
Perplexity = 713, Mean Sigma = 68.43397204436404

Please note that the probability distribution of distances between data points in high dimensions corresponding to large perplexity equal to 713 is centered around a very small values 0.0013, implying that despite the perplexity is large, the high-dimensional probability still does not approach one, and not even close to one.

We can immediately see that at small perplexity the "memory" parameter $\sigma_i$ grows approximetely linearly with perplexity. However, when perplexity approaches the sample size N, the "memory" parameter $\sigma_i$ hyperbolically goes to infinity. We can approximate the behavior of $\sigma_i$ as a function of perplexity with the following simple assymptotic:

$$\sigma (\rm{Perp}) \approx \frac{\rm{Perp} / N}{1 - \rm{Perp} / N}$$

Let us check how well this assymptotic describes the obtained function $\sigma (\rm{Perp})$:

In [3]:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np

N = X_train.shape[0]

perp = np.array(my_perp)
sigma_exact = np.array(my_sigma_tSNE)

sigma = lambda perp, a, b, c: (c*(perp / N)**a) / (1 - (perp / N)**b)
    
p , _ = optimize.curve_fit(sigma, perp, sigma_exact)
print(p)

plt.figure(figsize=(20,15))
plt.plot(perp, sigma_exact, "o")
plt.plot(perp, sigma(perp, p[0], p[1], p[2]), c = "red")
plt.title("Non-Linear Least Square Fit", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("Y", fontsize = 20)
plt.show()
[  0.87962035 111.68423252  26.71340589]

The fact the parameter $\sigma$ can in principal go to infinity at large perplexity means in practice that the contribution to the gradient descent from the cost function disappears and tSNE becomes heavily dominated by its initialization at large perplexities. Let us prove it from the functional form of the tSNE gradient of the cost function.

Both tSNE and UMAP start with an initialization (random, PCA or Laplacian Eigenmaps) and update the coordinates via the gradient descent algorithm. Here I will ignore normalization constants in the equations of probabilities:

$$y_i = y_i -\mu \frac{\partial KL}{\partial y_i}; \quad \frac{\partial KL}{\partial y_i} = 4\sum_j{(p_{ij}-q_{ij})(y_i-y_j)\frac{1}{1+(y_i-y_j)^2}}; \quad q_{ij}\approx \frac{1}{1+(y_i-y_j)^2}; \quad p_{ij}\approx \displaystyle e^{\displaystyle -\frac{(x_i-x_j)^2}{2\sigma_i^2}}$$

In the limit $\sigma_i \rightarrow \infty$ the probability to observe points at a distance in high-dimensional space becomes $p_{ij} \rightarrow 1$. Therefore:

$$\frac{\partial KL}{\partial y_i} \approx 4\sum_j{\left(1-\frac{1}{1+(y_i-y_j)^2}\right)(y_i-y_j)\frac{1}{1+(y_i-y_j)^2}} = 4\sum_j{\frac{(y_i-y_j)^3}{(1+(y_i-y_j)^2)^2}}$$

In the limit of close embedding points: $$y_i-y_j \rightarrow 0: \quad \frac{\partial KL}{\partial y_i} \approx 4\sum_i{(y_i-y_j)^3} \rightarrow 0$$ In the limit of distant embedding points: $$y_i-y_j \rightarrow \infty: \quad \frac{\partial KL}{\partial y_i} \approx 4\sum_i{\frac{1}{y_i-y_j}} \rightarrow 0$$

We conclude that the contribution to the gradient descent updating rule from the cost function, $\displaystyle\mu\frac{\partial KL}{\partial y_i}$, disappears and therefore if one starts with PCA as an initialization step one ends up with the PCA at the end as initial positions of the points are not getting updated by the gradient descent.

Let us now compute the $\sigma(\rm{n\_neighbor})$ dependence for UMAP and compare it to tSNE:

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances


dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]

plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
    d = dist[dist_row] - rho[dist_row]
    d[d < 0] = 0
    return np.exp(- d / sigma)

def k(prob):
    return np.power(2, np.sum(prob))

def sigma_binary_search(k_of_sigma, fixed_k):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if k_of_sigma(approx_sigma) < fixed_k:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 10):

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: k(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
    
    my_n_neighbor.append(N_NEIGHBOR)
    my_sigma_umap.append(np.mean(sigma_array))

plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("UMAP: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
sns.distplot(sigma_array)
plt.title("UMAP: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
N_neighbor = 3, Mean Sigma = 0.00095367431640625
N_neighbor = 13, Mean Sigma = 60.18894211540009
N_neighbor = 23, Mean Sigma = 72.26585143105278
N_neighbor = 33, Mean Sigma = 78.65388300165783
N_neighbor = 43, Mean Sigma = 82.91104119583215
N_neighbor = 53, Mean Sigma = 86.05896161255224
N_neighbor = 63, Mean Sigma = 88.53755450115524
N_neighbor = 73, Mean Sigma = 90.57147942441802
N_neighbor = 83, Mean Sigma = 92.28989531873991
N_neighbor = 93, Mean Sigma = 93.77371931875219
N_neighbor = 103, Mean Sigma = 95.07648372117367
N_neighbor = 113, Mean Sigma = 96.23584667397611
N_neighbor = 123, Mean Sigma = 97.27877611554534
N_neighbor = 133, Mean Sigma = 98.22561088221033
N_neighbor = 143, Mean Sigma = 99.0917163188231
N_neighbor = 153, Mean Sigma = 99.88931437444421
N_neighbor = 163, Mean Sigma = 100.62791781718505
N_neighbor = 173, Mean Sigma = 101.3152372903664
N_neighbor = 183, Mean Sigma = 101.95771009562402
N_neighbor = 193, Mean Sigma = 102.56053349159284
N_neighbor = 203, Mean Sigma = 103.12805229059144
N_neighbor = 213, Mean Sigma = 103.66415711088554
N_neighbor = 223, Mean Sigma = 104.17187280495074
N_neighbor = 233, Mean Sigma = 104.6539418524204
N_neighbor = 243, Mean Sigma = 105.11270981261184
N_neighbor = 253, Mean Sigma = 105.55032245273696
N_neighbor = 263, Mean Sigma = 105.96843005558632
N_neighbor = 273, Mean Sigma = 106.3687801361084
N_neighbor = 283, Mean Sigma = 106.75261939704085
N_neighbor = 293, Mean Sigma = 107.12125447875295
N_neighbor = 303, Mean Sigma = 107.47582419624541
N_neighbor = 313, Mean Sigma = 107.81723160983464
N_neighbor = 323, Mean Sigma = 108.14648633562653
N_neighbor = 333, Mean Sigma = 108.46436090309527
N_neighbor = 343, Mean Sigma = 108.77148798724127
N_neighbor = 353, Mean Sigma = 109.06863345780187
N_neighbor = 363, Mean Sigma = 109.35646195651432
N_neighbor = 373, Mean Sigma = 109.63532644943152
N_neighbor = 383, Mean Sigma = 109.90591422139599
N_neighbor = 393, Mean Sigma = 110.16860354546063
N_neighbor = 403, Mean Sigma = 110.42387392267834
N_neighbor = 413, Mean Sigma = 110.67202104536514
N_neighbor = 423, Mean Sigma = 110.91354306183713
N_neighbor = 433, Mean Sigma = 111.14873033662082
N_neighbor = 443, Mean Sigma = 111.37780930076897
N_neighbor = 453, Mean Sigma = 111.6011875301766
N_neighbor = 463, Mean Sigma = 111.81909678368595
N_neighbor = 473, Mean Sigma = 112.03168623940239
N_neighbor = 483, Mean Sigma = 112.23934748985248
N_neighbor = 493, Mean Sigma = 112.44219508250998
N_neighbor = 503, Mean Sigma = 112.64057265979618
N_neighbor = 513, Mean Sigma = 112.83446157444789
N_neighbor = 523, Mean Sigma = 113.02424542730746
N_neighbor = 533, Mean Sigma = 113.20994819342756
N_neighbor = 543, Mean Sigma = 113.39183359838731
N_neighbor = 553, Mean Sigma = 113.56999487850253
N_neighbor = 563, Mean Sigma = 113.74458920356281
N_neighbor = 573, Mean Sigma = 113.91572845714718
N_neighbor = 583, Mean Sigma = 114.08359112020311
N_neighbor = 593, Mean Sigma = 114.24823313452012
N_neighbor = 603, Mean Sigma = 114.40984230467727
N_neighbor = 613, Mean Sigma = 114.56845725714827
N_neighbor = 623, Mean Sigma = 114.72423649367008
N_neighbor = 633, Mean Sigma = 114.87725060745325
N_neighbor = 643, Mean Sigma = 115.0276034903926
N_neighbor = 653, Mean Sigma = 115.17533510090918
N_neighbor = 663, Mean Sigma = 115.32060527268735
N_neighbor = 673, Mean Sigma = 115.46343531688498
N_neighbor = 683, Mean Sigma = 115.60397973939693
N_neighbor = 693, Mean Sigma = 115.74217460674947
N_neighbor = 703, Mean Sigma = 115.87817708873216
N_neighbor = 713, Mean Sigma = 116.0121177161872

Please note that the high-dimensional probability values became much larger for UMAP and some small peak around 1 is quite visible.

We can see that $\sigma$ depends on n_neighbors in a very different way for UMAP compared to tSNE. It does not go to infinity that fast compared to tSNE. This implies that as Perplexity / N_neighbors approaches the sample size N, for UMAP $\sigma$ remains finite while for tSNE $\sigma$ goes to infinity. We can approximate $\sigma(\rm{n\_neighbor})$ dependence with the following simple expression:

$$\sigma(\rm{n\_neighbor})\approx -\frac{1}{\displaystyle\ln\left(\frac{\log_2(\rm{n\_neighbor})}{N}\right)}$$
In [5]:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np

N = X_train.shape[0]

n_neighbor = np.array(my_n_neighbor)
sigma_exact = np.array(my_sigma_umap)

sigma = lambda n_neighbor, a, b: - a / np.log(b * np.log2(n_neighbor) / N)
    
p , _ = optimize.curve_fit(sigma, n_neighbor, sigma_exact)
print(p)

plt.figure(figsize=(20,15))
plt.plot(n_neighbor, sigma_exact, "o")
plt.plot(n_neighbor, sigma(n_neighbor, p[0], p[1]), c = "red")
plt.title("Non-Linear Least Square Fit", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("Y", fontsize = 20)
plt.show()
[136.0680298   24.09538417]

Let us compare how Perplexity and N_neighbors hyperparameters behave for tSNE and UMAP, respectively, for the same data set where the Euclidean distances are fixed. For this purpose we need to realize that the computed sigma for UMAP is not directly comparable with the computed sigma for tSNE since we have $2\sigma^2$ in the denominator of the power of the exponent in the equation for high-dimensional probabilities, while it is just $\sigma$ in the corresponding denominator for UMAP. Therefore, we need to square all the obtained tSNE sigmas and multiply them by two.

In [6]:
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
In [7]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')

plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

UMAP's mean sigma does not change much with n_neighbor and reaches much smaller values than tSNE's mean sigma, so really hard to compare them as they are not on the same sacel, log-transform did not improve the view, let us restrict the x-axis in order to enlarge and resolve the behavior of UMAP's mean sigma.

In [11]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')

plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.ylim(0,600); plt.xlim(0,500)
plt.show()

Again we can see that for UMAP the dependence $\sigma(\rm{n\_neighbor})$ is logarithmic and therefore very slow while for tSNE $\sigma(\rm{Perplexity})$ very quickly goes to infinity when Perplexity approaches the sample size N. Therefore tSNE is much more sensitive to the Perplexity hyperparameter than UMAP towards N_neighbors hyperparameter.

Computing Mean Sigma vs. Perplexity and N_neighbors for World Map Data Set

In order to be on a safe side with our conclusion that mean sigma vs. n_neighbors goes to infinity much slower for UMAP than mean sigma vs. perplexity for tSNE, which was done for Cancer Associated Fibroblasts (CAFs) data set, we will recompute these curves for the synthetic World Map data set. We will do it first for the 2D (linear manifold) data set and then repeat it for the 3D Swiss Roll embedded data set (non-linear manifold).

In [2]:
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader

shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
                                        category='cultural', name=shapename)

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    #print(country.attributes['NAME_LONG'])
    if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('NorthAmerica.png')
plt.close()
        
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela', 
                                           'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('SouthAmerica.png')
plt.close()

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Australia']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Australia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia', 
                                          'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden', 
                                           'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway', 
                                           'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria', 
                                           'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland', 
                                           'United Kingdom', 'Saudi Arabia', 'Hungary']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Eurasia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
                                           'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia', 
                                           'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
                                           'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic', 
                                           'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia', 
                                           'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa', 
                                           'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
                                           'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 
                                           'Nigeria', 'Cameroon']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Africa.png')
plt.close()


rng = np.random.RandomState(123)
plt.figure(figsize = (20,15))

N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)

N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)

N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)

N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)

N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)

X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)

plt.show()
(3023, 2)
(3023,)
In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

import warnings
warnings.filterwarnings("ignore")

X_train = X
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)

dist = np.square(euclidean_distances(X_train, X_train))

plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(3, X_train.shape[0], 200):
    
    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, PERPLEXITY)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
    
    my_perp.append(PERPLEXITY)
    my_sigma_tSNE.append(np.mean(sigma_array))
    
plt.figure(figsize = (20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("tSNE: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 2) (3023,)
Perplexity = 3, Mean Sigma = 0.0020824360445245308
Perplexity = 203, Mean Sigma = 0.026983336190928145
Perplexity = 403, Mean Sigma = 0.04243961123492676
Perplexity = 603, Mean Sigma = 0.05657405518870302
Perplexity = 803, Mean Sigma = 0.07055265557241487
Perplexity = 1003, Mean Sigma = 0.08290341623642568
Perplexity = 1203, Mean Sigma = 0.09496204907476921
Perplexity = 1403, Mean Sigma = 0.1074566653419164
Perplexity = 1603, Mean Sigma = 0.12099297284054669
Perplexity = 1803, Mean Sigma = 0.13608960887670438
Perplexity = 2003, Mean Sigma = 0.1530494273517292
Perplexity = 2203, Mean Sigma = 0.17244280311314428
Perplexity = 2403, Mean Sigma = 0.19591208763008988
Perplexity = 2603, Mean Sigma = 0.22778556790921295
Perplexity = 2803, Mean Sigma = 0.28223238991381827
Perplexity = 3003, Mean Sigma = 0.5550955527277526

We can see that the Mean sigma vs. perplexity dependence for the World Map synthetic data set is qualitatively very similar to the one for the CAFs scRNAseq data set despite the sample size is three times larger now. We notice again that the high-dimensional probabilities are not close to one again despite we reach the large perplexity values close to the sample size of the data set.

Now let us check how UMAP's mean sigma vs. n_neighbors behaves for the synthetic World Map data set. Here I discovered that at default UMAP's hyperparameters local_connectivity = 1 and bandwidth = 1, the mean sigma is almost constant and has a very low values in the order of magnitude ~$10^{-5}$. Thus in order to compare it with tSNE's mean sigma vs. perplexity I had to increase bandwidth hyperparameter from default 1 to 200.

In [31]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

import warnings
warnings.filterwarnings("ignore")

X_train = X
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)

BANDWIDTH = 200

dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]

plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
    d = dist[dist_row] - rho[dist_row]
    d[d < 0] = 0
    return np.exp(- d / sigma)

def k(prob):
    return np.power(2, np.sum(prob) / BANDWIDTH)

def sigma_binary_search(k_of_sigma, fixed_k):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if k_of_sigma(approx_sigma) < fixed_k:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 200):

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: k(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
    
    my_n_neighbor.append(N_NEIGHBOR)
    my_sigma_umap.append(np.mean(sigma_array))
        
plt.figure(figsize = (20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("UMAP: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("UMAP: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 2) (3023,)
N_neighbor = 3, Mean Sigma = 0.00778965472070894
N_neighbor = 203, Mean Sigma = 0.08925199114325975
N_neighbor = 403, Mean Sigma = 0.11678645834006342
N_neighbor = 603, Mean Sigma = 0.13711868119563167
N_neighbor = 803, Mean Sigma = 0.15409679709386243
N_neighbor = 1003, Mean Sigma = 0.16911141019840725
N_neighbor = 1203, Mean Sigma = 0.1827940970826851
N_neighbor = 1403, Mean Sigma = 0.19555181767662116
N_neighbor = 1603, Mean Sigma = 0.2076098195693369
N_neighbor = 1803, Mean Sigma = 0.21915612455791406
N_neighbor = 2003, Mean Sigma = 0.23023931545569257
N_neighbor = 2203, Mean Sigma = 0.2409641292368857
N_neighbor = 2403, Mean Sigma = 0.25143277909319567
N_neighbor = 2603, Mean Sigma = 0.26164589597025023
N_neighbor = 2803, Mean Sigma = 0.27171326440728505
N_neighbor = 3003, Mean Sigma = 0.28158125402593787

Note that the high-dimensional probabilities have the mode around 1, therefore there is a dramatic difference compared to tSNE where the high-dimensional probabilities are never close to 1. This is probably not only the effect of large n_neighbor hyperparameter but also the bandwidth which multiplies the log2(n_neighbor) and thus effectively increases the n_neighbor hyperparameter. In a sense, bandwidth has something to do with the empirical early exxageration hyperparameter of tSNE because increasing the n_neighbor value it effectively increases the high-dimensional probability values by some factor, the same way as the early exxageration increases the high-dimensional probabilities for tSNE.

In [12]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/bandwidth.png', width=2000)
Out[12]:
In [30]:
from umap import umap_
plt.figure(figsize = (20, 15))

my_n_neighbors = []; my_sigma_umap = []
for n_neighbors in range(3, X_train.shape[0], 200):
    sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors, 
                                                   local_connectivity = 1, bandwidth = 200)
    my_sigma_umap.append(np.mean(sigmas_umap))
    my_n_neighbors.append(n_neighbors)
    print("N_neighbor = {0}, Mean Sigma = {1}".format(n_neighbors, np.mean(sigmas_umap)))

plt.plot(my_n_neighbors, my_sigma_umap, '-o')
plt.title("Sigma vs. N_neighbors: UMAP implementation of Leland McInnes", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
N_neighbor = 3, Mean Sigma = 0.0005434110795103611
N_neighbor = 203, Mean Sigma = 0.029689674540503756
N_neighbor = 403, Mean Sigma = 0.03986285083839624
N_neighbor = 603, Mean Sigma = 0.047277319079436254
N_neighbor = 803, Mean Sigma = 0.053450711169523116
N_neighbor = 1003, Mean Sigma = 0.05890858200760852
N_neighbor = 1203, Mean Sigma = 0.06389338874205458
N_neighbor = 1403, Mean Sigma = 0.06854460156423178
N_neighbor = 1603, Mean Sigma = 0.0729488659037416
N_neighbor = 1803, Mean Sigma = 0.07716326602196168
N_neighbor = 2003, Mean Sigma = 0.08122730945554014
N_neighbor = 2203, Mean Sigma = 0.08517411317425996
N_neighbor = 2403, Mean Sigma = 0.08902563587936704
N_neighbor = 2603, Mean Sigma = 0.09279683077752168
N_neighbor = 2803, Mean Sigma = 0.09650180422602403
N_neighbor = 3003, Mean Sigma = 0.10015113465151432

Now we can plot mean sigma vs. perplexity (tSNE) and n_neighbors (UMAP) against each other. Here we need to take into account the catch that we have first power of sigma in the equation for high-dimensional probability for UMAP and second power of sigma, i.e. $2\sigma^2$, in the denomenator of the power of the exponent in the equation for high-dimensional probability for tSNE. Since the goal of this analysis is to compara how much denomenator of the power of the exponent is affected by increasing the number of nearest neighbors in the neighborhood graph (tSNE, UMAP), we need to compare $2\sigma^2$, which is tSNE denomenator of the power in the exponent, again $\sigma$ for UMAP.

In [33]:
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
In [36]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')

plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

Increaseing the bandwidth yperparameter helped to put UMAP's and tSNE's mean sigma on the same scale for the World Map data set, however the main conclusion about logarithmic grows of the mean sigma as a function of n_neighbor remains the same. I.e. UMAP's mean sigma is less sensitive towards increasing n_neighbor than tSNE's mean sigma towards perplexity. UMAP's mean sigma does not diverge hyperbolically when n_neighbors approaches the samples sixe of the data set, in contrast to tSNE's mean sigma that jumps to infinity at perplexity equal to the sample size.

Let us now see how the above mean sigma vs. n_neighbor / perplexity dependence looks for the World Map embedded into the Swiss Roll 3D non-linear manifold.

In [5]:
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)

X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
Out[5]:
(3023, 3)
In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

import warnings
warnings.filterwarnings("ignore")

X_train = X_swiss_roll
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)

dist = np.square(euclidean_distances(X_train, X_train))

plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(3, X_train.shape[0], 200):
    
    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, PERPLEXITY)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
    
    my_perp.append(PERPLEXITY)
    my_sigma_tSNE.append(np.mean(sigma_array))
    
plt.figure(figsize = (20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("tSNE: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 3) (3023,)
Perplexity = 3, Mean Sigma = 0.005075642102765517
Perplexity = 203, Mean Sigma = 0.07065486876411703
Perplexity = 403, Mean Sigma = 0.11879980583834482
Perplexity = 603, Mean Sigma = 0.16706462258185292
Perplexity = 803, Mean Sigma = 0.21617364457534552
Perplexity = 1003, Mean Sigma = 0.2646946252437368
Perplexity = 1203, Mean Sigma = 0.30373943352833355
Perplexity = 1403, Mean Sigma = 0.3379745432922787
Perplexity = 1603, Mean Sigma = 0.3708556437389842
Perplexity = 1803, Mean Sigma = 0.4045866279463243
Perplexity = 2003, Mean Sigma = 0.44117264111941906
Perplexity = 2203, Mean Sigma = 0.4833387374246984
Perplexity = 2403, Mean Sigma = 0.5356788319781093
Perplexity = 2603, Mean Sigma = 0.6086218250614107
Perplexity = 2803, Mean Sigma = 0.7365359559660118
Perplexity = 3003, Mean Sigma = 1.3994935566015367

Now let us see how mean sigma vs. n_neighbors looks like for UMAP performing on the Swiss Roll embedded 3D data set:

In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

import warnings
warnings.filterwarnings("ignore")

X_train = X_swiss_roll
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)

BANDWIDTH = 200

dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]

plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
    d = dist[dist_row] - rho[dist_row]
    d[d < 0] = 0
    return np.exp(- d / sigma)

def k(prob):
    return np.power(2, np.sum(prob) / BANDWIDTH)

def sigma_binary_search(k_of_sigma, fixed_k):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if k_of_sigma(approx_sigma) < fixed_k:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 200):

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: k(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
    
    my_n_neighbor.append(N_NEIGHBOR)
    my_sigma_umap.append(np.mean(sigma_array))
        
plt.figure(figsize = (20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("UMAP: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("UMAP: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 3) (3023,)
N_neighbor = 3, Mean Sigma = 0.06032534241873501
N_neighbor = 203, Mean Sigma = 0.7439154960286566
N_neighbor = 403, Mean Sigma = 0.9513407640018646
N_neighbor = 603, Mean Sigma = 1.1034586001341602
N_neighbor = 803, Mean Sigma = 1.2302408145825041
N_neighbor = 1003, Mean Sigma = 1.34216930706139
N_neighbor = 1203, Mean Sigma = 1.4442657738224072
N_neighbor = 1403, Mean Sigma = 1.539328458724809
N_neighbor = 1603, Mean Sigma = 1.629156818699056
N_neighbor = 1803, Mean Sigma = 1.7149364005797674
N_neighbor = 2003, Mean Sigma = 1.7975246594751113
N_neighbor = 2203, Mean Sigma = 1.8775184699489784
N_neighbor = 2403, Mean Sigma = 1.9554314217423903
N_neighbor = 2603, Mean Sigma = 2.0316061183312377
N_neighbor = 2803, Mean Sigma = 2.1063788537351336
N_neighbor = 3003, Mean Sigma = 2.179955316228737
In [10]:
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
In [11]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')

plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

Again we see similar things for 3D data as for the 2D, the non-linear embedding into the Swiss Roll did not seem to change much in sense of mean sigma vs. perplexity / n_neighbor dependence. We had to use the bandwidth = 200 hyperparameter in order to be able to compare mean sigma for tSNE with mean sigma for UMAP on a similar scale.

Looking at Gradients and Compare them with Initialization

Here we will try to understand the limit of high perplexity for tSNE and n_neighbors for UMAP. More specifically we will invistigate how much gradients in the gradient descent influence the initialization coordinates. For this purpose we again are going to use the syntetic World's Map data set where we know the ground truth, i.e. how far from each other and in which order continents are clocated on the map. We embed the World's Map into 3D non-linear manifold which is the Swiss Roll and compare tSNE vs. UMAP for the quality of the original data reconstruction.

In [2]:
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader

shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
                                        category='cultural', name=shapename)

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    #print(country.attributes['NAME_LONG'])
    if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('NorthAmerica.png')
plt.close()
        
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela', 
                                           'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('SouthAmerica.png')
plt.close()

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Australia']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Australia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia', 
                                          'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden', 
                                           'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway', 
                                           'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria', 
                                           'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland', 
                                           'United Kingdom', 'Saudi Arabia', 'Hungary']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Eurasia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
                                           'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia', 
                                           'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
                                           'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic', 
                                           'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia', 
                                           'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa', 
                                           'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
                                           'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 
                                           'Nigeria', 'Cameroon']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Africa.png')
plt.close()


rng = np.random.RandomState(123)
plt.figure(figsize = (20,15))

N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)

N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)

N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)

N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)

N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)

plt.show()
In [3]:
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)
(3023, 2)
(3023,)

Here we create the 3D Swiss Roll embedding of the World's Map data points. The internal dimensionality of the data is still 2D, therefore a good dimension reduction algorithms should be able to reconstruct the original World's Map.

In [4]:
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)

X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
Out[4]:
(3023, 3)

Let us test how well tSNE can reconstruct the original 2D World's Map embedded into the 3D Swiss Roll. For this purpose, we will be gradually increasing the Perplexity value in order to check the hypothesis that tSNE is capable of reconstructing original data at large enough perplexities.

In [16]:
import matplotlib
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore")

figure = plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})

X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)

for perplexity_index, perplexity in enumerate([50, 500, 1000, 2000]):
    print('Performing tSNE for Perplexity = {}'.format(perplexity))
    model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = perplexity, 
                 init = X_swiss_roll_reduced, n_iter = 1000, verbose = 0)
    tsne = model.fit_transform(X_swiss_roll)

    plt.subplot(221 + perplexity_index)
    plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
    plt.title('tSNE: Perplexity = {}'.format(perplexity), fontsize = 20) 
    plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)

figure.tight_layout()
plt.show()
Performing tSNE for Perplexity = 50
Performing tSNE for Perplexity = 500
Performing tSNE for Perplexity = 1000
Performing tSNE for Perplexity = 2000

Despite the hypothesis, we can see that increasing perplexity values actually decrease the quality of the original data reconstruction. At perplexity = 50, the World's Map looks distorted but fair enough, while at perplexities = 500 and 1000, the World's Map becomes unreasonable elongated and the order of the continents is not preserved any more. The most astonishing picture we observe at perplexity = 2000, here the World Map look like an archimedian spiral, i.e. very similar to the PCA reconstruction. The algorithm obviously has problems with convergence, however even increasing learning rate and number of iteration did not help at all, you are welcome to check it. Here we confirm our old suspicion that gradients in the tSNE algorithm become close to zero at large perplexity, so the algorithm does not really improve the original PCA initialization. So if one runs tSNE with a PCA initialization and increases the perplexity, one ends up with nothing else than PCA. Let us check if UMAP can do it better at large n_neighbor hyperparameter values.

In [12]:
import matplotlib
from umap import UMAP
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore")

figure = plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})

X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)

for n_neighbors_index, n_neighbors in enumerate([50, 500, 1000, 2000]):
    print('Performing UMAP for n_neighbors = {}'.format(n_neighbors))
    model = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = n_neighbors, 
                 init = X_swiss_roll_reduced, n_epochs = 1000, verbose = 0, spread = 2)
    umap = model.fit_transform(X_swiss_roll)

    plt.subplot(221 + n_neighbors_index)
    plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
    plt.title('UMAP: n_neighbors = {}'.format(n_neighbors), fontsize = 20) 
    plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)

figure.tight_layout()
plt.show()
Performing UMAP for n_neighbours = 50
Performing UMAP for n_neighbours = 500
Performing UMAP for n_neighbours = 1000
Performing UMAP for n_neighbours = 2000

Here we can see that UMAP is not particularly sensitive to the n_neighbor hyperparameter value, the UMAP visulaizations for n_neighbor = 50, 500, 1000 and 2000 are fairly comparable. This again confirms what we have learnt about sigma(n_neighbor) logarithmic dependence for UMAP, i.e. sigma does not reach large values even if one dramatically increases n_neighbor.

Let us plot how gradient of tSNE changes with increasing perplexity and prove that it becomes negligible compared to the initialization contribution into the gradient descent rule. We will use only gradients at the beginning of optimization, as this is where major changes happen, for simplicity we select a gradient after 50 iterations.

In [42]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 750, 
             init = X_swiss_roll_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 20); plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing 2251 nearest neighbors...
[t-SNE] Indexed 3023 samples in 0.000s...
[t-SNE] Computed neighbors for 3023 samples in 0.791s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.272962
[t-SNE] Computed conditional probabilities in 3.360s
[t-SNE] Iteration 50: error = 40.9814262, gradient norm = 0.0040186 (50 iterations in 3.489s)
[t-SNE] Iteration 100: error = 40.9009171, gradient norm = 0.0004998 (50 iterations in 3.425s)
[t-SNE] Iteration 150: error = 40.8984680, gradient norm = 0.0006248 (50 iterations in 4.150s)
[t-SNE] Iteration 200: error = 40.9006653, gradient norm = 0.0006895 (50 iterations in 4.122s)
[t-SNE] Iteration 250: error = 40.8998070, gradient norm = 0.0007380 (50 iterations in 4.244s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 40.899807
[t-SNE] Iteration 300: error = 0.0956131, gradient norm = 0.0003993 (50 iterations in 3.797s)
[t-SNE] Iteration 350: error = 0.0852959, gradient norm = 0.0000912 (50 iterations in 4.607s)
[t-SNE] Iteration 400: error = 0.0836744, gradient norm = 0.0000200 (50 iterations in 3.807s)
[t-SNE] Iteration 450: error = 0.0834190, gradient norm = 0.0000158 (50 iterations in 4.917s)
[t-SNE] Iteration 500: error = 0.0833934, gradient norm = 0.0000169 (50 iterations in 3.561s)
[t-SNE] Iteration 550: error = 0.0834068, gradient norm = 0.0000131 (50 iterations in 4.393s)
[t-SNE] Iteration 600: error = 0.0835572, gradient norm = 0.0000156 (50 iterations in 3.478s)
[t-SNE] Iteration 650: error = 0.0834723, gradient norm = 0.0000132 (50 iterations in 4.412s)
[t-SNE] Iteration 700: error = 0.0833502, gradient norm = 0.0000120 (50 iterations in 3.516s)
[t-SNE] Iteration 750: error = 0.0832226, gradient norm = 0.0000146 (50 iterations in 4.467s)
[t-SNE] Iteration 800: error = 0.0831838, gradient norm = 0.0000152 (50 iterations in 3.544s)
[t-SNE] Iteration 850: error = 0.0830954, gradient norm = 0.0000145 (50 iterations in 4.451s)
[t-SNE] Iteration 900: error = 0.0829170, gradient norm = 0.0000143 (50 iterations in 3.544s)
[t-SNE] Iteration 950: error = 0.0829316, gradient norm = 0.0000151 (50 iterations in 4.677s)
[t-SNE] Iteration 1000: error = 0.0828066, gradient norm = 0.0000176 (50 iterations in 4.033s)
[t-SNE] KL divergence after 1000 iterations: 0.082807
In [43]:
tsne_perplexity = [3, 10, 20, 30, 50, 80, 100, 200, 500, 750, 1000, 1500, 2000]
tsne_perplexity
Out[43]:
[3, 10, 20, 30, 50, 80, 100, 200, 500, 750, 1000, 1500, 2000]
In [44]:
tsne_gradient = [0.0785314, 0.0414363, 0.0267256, 0.0240762, 0.0189936, 0.0184264, 0.0127485, 0.0105595, 
                0.0070886, 0.0040186, 0.0004717, 0.0000006, 0.0000004]
tsne_gradient
Out[44]:
[0.0785314,
 0.0414363,
 0.0267256,
 0.0240762,
 0.0189936,
 0.0184264,
 0.0127485,
 0.0105595,
 0.0070886,
 0.0040186,
 0.0004717,
 6e-07,
 4e-07]
In [45]:
plt.figure(figsize = (20,15))
plt.plot(tsne_perplexity, tsne_gradient, "-o")
plt.title("tSNE Gradient vs. Perplexity", fontsize = 22)
plt.xlabel("Perplexity", fontsize = 22); plt.ylabel("tSNE Gradient", fontsize = 22)
plt.show()
In [ ]:
 
In [ ]:
 
In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

N_LOW_DIMS = 2
MAX_ITER = 200
PERPLEXITY = 100
LEARNING_RATE = 0.1

path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]; X_train = np.log(X_train + 1); n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
print('\n')

dist = np.square(euclidean_distances(X_train, X_train))
X_reduced = PCA(n_components = 2).fit_transform(X_train)

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
    func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
    binary_search_result = sigma_binary_search(func, PERPLEXITY)
    prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
    sigma_array.append(binary_search_result)
    if (dist_row + 1) % 100 == 0:
        print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))

P = prob + np.transpose(prob)

def prob_low_dim(Y):
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    np.fill_diagonal(inv_distances, 0.)
    return inv_distances / np.sum(inv_distances, axis = 1, keepdims = True)

def KL(P, Y):
    Q = prob_low_dim(Y)
    return P * np.log(P + 0.01) - P * np.log(Q + 0.01)

def KL_gradient(P, Y):
    Q = prob_low_dim(Y)
    y_diff = np.expand_dims(Y, 1) - np.expand_dims(Y, 0)
    inv_dist = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    return 4*np.sum(np.expand_dims(P - Q, 2) * y_diff * np.expand_dims(inv_dist, 2), axis = 1)

np.random.seed(12345)
#y = np.random.normal(loc = 0, scale = 1, size = (n, N_LOW_DIMS))
y = X_reduced
KL_array = []; KL_gradient_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
    y = y - LEARNING_RATE * KL_gradient(P, y)
    KL_array.append(np.sum(KL(P, y)))
    KL_gradient_array.append(np.sum(KL_gradient(P, y)))
    if i % 100 == 0:
        print("KL divergence = " + str(np.sum(KL(P, y))))
        
plt.figure(figsize=(20,15))
plt.plot(KL_array,'-o')
plt.title("KL-divergence", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
plt.plot(KL_gradient_array,'-o')
plt.title("KL-divergence Gradient", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE GRADIENT", fontsize = 20)
plt.show()
                1110020A21Rik  1110046J04Rik  1190002F15Rik  1500015A07Rik
SS2_15_0048_A3            0.0            0.0            0.0            0.0
SS2_15_0048_A6            0.0            0.0            0.0            0.0
SS2_15_0048_A5            0.0            0.0            0.0            0.0
SS2_15_0048_A4            0.0            0.0            0.0            0.0

This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)


Sigma binary search finished 100 of 716 cells
Sigma binary search finished 200 of 716 cells
Sigma binary search finished 300 of 716 cells
Sigma binary search finished 400 of 716 cells
Sigma binary search finished 500 of 716 cells
Sigma binary search finished 600 of 716 cells
Sigma binary search finished 700 of 716 cells

Mean sigma = 8.084936514913037
Running Gradient Descent: 

KL divergence = 1228.1195036162594
KL divergence = 1130.2023257098751
In [ ]:
 
In [ ]:
 

Making SciKitLearn tSNE Agree with My Implementation of tSNE

I am a bit worried that the mean sigma reported by the scikitlearn implementation of tSNE is sligthly higher than the one determined in my implementation. To understand whether I have a bug in my code, I will dig into the scikitlearn and Rtsne codes. First let us run my implementation on Cancer Associated Fibroblasts (CAFs) scRNAseq data set:

In [2]:
import numpy as np; import pandas as pd; import seaborn as sns
from sklearn.metrics.pairwise import euclidean_distances; import matplotlib.pyplot as plt

path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]; X_train = np.log(X_train + 1); n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)

dist = np.square(euclidean_distances(X_train, X_train))

def prob_high_dim(sigma, dist_row):
        exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
        exp_distance[dist_row] = 0
        prob_not_symmetr = exp_distance / np.sum(exp_distance)
        return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

PERPLEXITY = 30
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
    func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
    binary_search_result = sigma_binary_search(func, PERPLEXITY)
    prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
    sigma_array.append(binary_search_result)
        
print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
                1110020A21Rik  1110046J04Rik  1190002F15Rik  1500015A07Rik
SS2_15_0048_A3            0.0            0.0            0.0            0.0
SS2_15_0048_A6            0.0            0.0            0.0            0.0
SS2_15_0048_A5            0.0            0.0            0.0            0.0
SS2_15_0048_A4            0.0            0.0            0.0            0.0

This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)
Perplexity = 30, Mean Sigma = 6.374859943070225

As we can see, the mean sigma parameter is equal to 6.37 for Perplexity = 30. However, when we run the scikitlearn implementation of tSNE we get mean sigma equal to 7.54 for the same preplexity, see below:

In [16]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X_train)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 30, 
             init = X_reduced, n_iter = 1000, method = 'exact', verbose = 2)
tsne = model.fit_transform(X_train)
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 716 / 716
[t-SNE] Mean sigma: 7.540113
[t-SNE] Iteration 50: error = 79.5732811, gradient norm = 0.3724284 (50 iterations in 0.635s)
[t-SNE] Iteration 100: error = 81.5138310, gradient norm = 0.3666290 (50 iterations in 0.664s)
[t-SNE] Iteration 150: error = 82.5920636, gradient norm = 0.3353633 (50 iterations in 0.685s)
[t-SNE] Iteration 200: error = 81.7540246, gradient norm = 0.3565094 (50 iterations in 0.692s)
[t-SNE] Iteration 250: error = 83.2222226, gradient norm = 0.3301316 (50 iterations in 0.681s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 83.222223
[t-SNE] Iteration 300: error = 1.7064036, gradient norm = 0.0037092 (50 iterations in 0.668s)
[t-SNE] Iteration 350: error = 1.4779231, gradient norm = 0.0017475 (50 iterations in 0.664s)
[t-SNE] Iteration 400: error = 1.3499565, gradient norm = 0.0018632 (50 iterations in 0.666s)
[t-SNE] Iteration 450: error = 1.3239029, gradient norm = 0.0004190 (50 iterations in 0.663s)
[t-SNE] Iteration 500: error = 1.3149131, gradient norm = 0.0003839 (50 iterations in 0.665s)
[t-SNE] Iteration 550: error = 1.3070469, gradient norm = 0.0003388 (50 iterations in 0.663s)
[t-SNE] Iteration 600: error = 1.3024996, gradient norm = 0.0001431 (50 iterations in 0.682s)
[t-SNE] Iteration 650: error = 1.2944050, gradient norm = 0.0022338 (50 iterations in 0.663s)
[t-SNE] Iteration 700: error = 1.2901842, gradient norm = 0.0000879 (50 iterations in 0.663s)
[t-SNE] Iteration 750: error = 1.2884903, gradient norm = 0.0001462 (50 iterations in 0.657s)
[t-SNE] Iteration 800: error = 1.2864119, gradient norm = 0.0003161 (50 iterations in 0.658s)
[t-SNE] Iteration 850: error = 1.2854452, gradient norm = 0.0001276 (50 iterations in 0.656s)
[t-SNE] Iteration 900: error = 1.2821248, gradient norm = 0.0003114 (50 iterations in 0.656s)
[t-SNE] Iteration 950: error = 1.2788269, gradient norm = 0.0001790 (50 iterations in 0.657s)
[t-SNE] Iteration 1000: error = 1.2783007, gradient norm = 0.0000431 (50 iterations in 0.675s)
[t-SNE] KL divergence after 1000 iterations: 1.278301

This can be for eaxample because I calculate Euclidean distances somehow in a wrong way, let us see how scikitlearn calculates pairwise Euclidean distances:

In [4]:
from sklearn.metrics.pairwise import pairwise_distances
dist_X = pairwise_distances(X_train, metric = 'euclidean', squared = True)
dist_X
Out[4]:
array([[   0.        ,  914.95016311, 1477.46836099, ..., 1058.56614432,
        3328.59582444, 1478.18928181],
       [ 914.95016311,    0.        , 1307.39294642, ...,  867.05197552,
        3044.97834743, 1376.96312866],
       [1477.46836099, 1307.39294642,    0.        , ..., 1400.74705041,
        2986.18951734, 1308.08666642],
       ...,
       [1058.56614432,  867.05197552, 1400.74705041, ...,    0.        ,
        2849.67875141, 1253.93197644],
       [3328.59582444, 3044.97834743, 2986.18951734, ..., 2849.67875141,
           0.        , 2748.92021714],
       [1478.18928181, 1376.96312866, 1308.08666642, ..., 1253.93197644,
        2748.92021714,    0.        ]])

And now compare with the pairwise Euclidean distances computed in my code:

In [5]:
dist
Out[5]:
array([[   0.        ,  914.95016311, 1477.46836099, ..., 1058.56614432,
        3328.59582444, 1478.18928181],
       [ 914.95016311,    0.        , 1307.39294642, ...,  867.05197552,
        3044.97834743, 1376.96312866],
       [1477.46836099, 1307.39294642,    0.        , ..., 1400.74705041,
        2986.18951734, 1308.08666642],
       ...,
       [1058.56614432,  867.05197552, 1400.74705041, ...,    0.        ,
        2849.67875141, 1253.93197644],
       [3328.59582444, 3044.97834743, 2986.18951734, ..., 2849.67875141,
           0.        , 2748.92021714],
       [1478.18928181, 1376.96312866, 1308.08666642, ..., 1253.93197644,
        2748.92021714,    0.        ]])

The pairwise Euclidean distances look identical. So this is not where the discrepancy comes from. I figured out that scikitlearn calculates mean sigma via a binary_search_perplexity function from utils module. We can quickly reproduce the mean sigma equal to 7.54 without running the whole tSNE procedure:

In [6]:
from sklearn.manifold import _utils
conditional_P = _utils._binary_search_perplexity(np.float32(dist_X), desired_perplexity = 30, verbose = 2)
[t-SNE] Computed conditional probabilities for sample 716 / 716
[t-SNE] Mean sigma: 7.540113

I also looked at the codes of the binary_search_perplexity function from here and also checked the C++ implementation used by the Rtsne wrapper from here. There are a few interesting discrepancies from my implementation of computing mean sigma.

First of all, scikitlearn and Rtsne use beta-parameter instead of sigma, where $\beta_i = 1 / (2\sigma_i^2)$, therefore $\sigma_i = \sqrt{1/(2\beta_i)}$ and $<\sigma> = \sqrt{1/(2<\beta>)}$, where $<\beta>=(1/N)\sum_i{\beta_i}\equiv (1/N)\beta_{\rm{sum}}$. MAking simple derivation we obtain: $<\sigma> = \sqrt{1/(2<\beta>)}=\sqrt{N/(2\beta_{\rm{sum}})}$. However we see that the coefficient 2 is ignored in the scikitlearn codes below at the very end. Moreover, from the definition of entropy $H=\sum_i{p_{ij}\log_2{p_{ij}}}$, however the base 2 of the logarithm is again ignored in the scikitlearn implementation and math.log, i.e. base 2.71, is used instead. In addition math.log(PERPLEXITY) is used for computing desired entropy, i.e. again base 2.71 instead of 2, while $\rm{Perplexity}=2^{\rm{entropy}}$ by definition. In summary, we obsreve quite a few discrepancies of the scikitlearn implementation from the mathematical formulation of the tSNE from Laurens vad der Maaten and Jeoffrey Hinton's original paper, therefore it looks like my implementation of computing sigma values is actually mathematically closer to the original algorithm than the scikitlearn implementation.

In [19]:
import math
my_perp = []; my_sigma = []
for PERPLEXITY in range(3, X_train.shape[0], 20):
    n_steps = 100; desired_entropy = math.log(PERPLEXITY)
    sqdistances = np.float32(dist_X)
    n_samples = sqdistances.shape[0]; n_neighbors = sqdistances.shape[1]
    PERPLEXITY_TOLERANCE = 1e-5; EPSILON_DBL = 1e-8; NPY_INFINITY = 1000
    P = np.zeros((n_samples, n_neighbors), dtype=np.float64); beta_sum = 0.0

    for i in range(n_samples):
    
        beta_min = -NPY_INFINITY; beta_max = NPY_INFINITY; beta = 1.0
        for l in range(n_steps):
            sum_Pi = 0.0
            for j in range(n_neighbors):
                if j != i or n_neighbors < n_samples:
                    P[i, j] = math.exp(-sqdistances[i, j] * beta)
                    sum_Pi += P[i, j]

            if sum_Pi == 0.0:
                sum_Pi = EPSILON_DBL
            sum_disti_Pi = 0.0

            for j in range(n_neighbors):
                P[i, j] /= sum_Pi
                sum_disti_Pi += sqdistances[i, j] * P[i, j]

            entropy = math.log(sum_Pi) + beta * sum_disti_Pi
            entropy_diff = entropy - desired_entropy

            if math.fabs(entropy_diff) <= PERPLEXITY_TOLERANCE:
                break

            if entropy_diff > 0.0:
                beta_min = beta
                if beta_max == NPY_INFINITY:
                    beta *= 2.0
                else:
                    beta = (beta + beta_max) / 2.0
            else:
                beta_max = beta
                if beta_min == -NPY_INFINITY:
                    beta /= 2.0
                else:
                    beta = (beta + beta_min) / 2.0

        beta_sum += beta

    my_perp.append(PERPLEXITY)
    my_sigma.append(np.mean(math.sqrt(n_samples / beta_sum)))
    print("Perplexity = {0}, Mean sigma: {1}".format(PERPLEXITY, np.mean(math.sqrt(n_samples / beta_sum))))
Perplexity = 3, Mean sigma: 4.651147004547775
Perplexity = 23, Mean sigma: 7.878144228295371
Perplexity = 43, Mean sigma: 8.832406082618935
Perplexity = 63, Mean sigma: 9.511747533639586
Perplexity = 83, Mean sigma: 10.077509424297272
Perplexity = 103, Mean sigma: 10.581021206265143
Perplexity = 123, Mean sigma: 11.04812020089174
Perplexity = 143, Mean sigma: 11.495192488673874
Perplexity = 163, Mean sigma: 11.933238154727839
Perplexity = 183, Mean sigma: 12.37012423974057
Perplexity = 203, Mean sigma: 12.81199926282527
Perplexity = 223, Mean sigma: 13.264106247170202
Perplexity = 243, Mean sigma: 13.731281576559637
Perplexity = 263, Mean sigma: 14.218279146146946
Perplexity = 283, Mean sigma: 14.730006300262993
Perplexity = 303, Mean sigma: 15.271664526985438
Perplexity = 323, Mean sigma: 15.848901101696448
Perplexity = 343, Mean sigma: 16.467749809473982
Perplexity = 363, Mean sigma: 17.13454200609421
Perplexity = 383, Mean sigma: 17.855525318658728
Perplexity = 403, Mean sigma: 18.636260085986837
Perplexity = 423, Mean sigma: 19.480829598837406
Perplexity = 443, Mean sigma: 20.391435913195984
Perplexity = 463, Mean sigma: 21.368716435989196
Perplexity = 483, Mean sigma: 22.413603698861568
Perplexity = 503, Mean sigma: 23.52987050285552
Perplexity = 523, Mean sigma: 24.7270144538872
Perplexity = 543, Mean sigma: 26.022583152711327
Perplexity = 563, Mean sigma: 27.445035808768086
Perplexity = 583, Mean sigma: 29.038796216859488
Perplexity = 603, Mean sigma: 30.873725971144566
Perplexity = 623, Mean sigma: 33.0657561193046
Perplexity = 643, Mean sigma: 35.82562322506101
Perplexity = 663, Mean sigma: 39.596115644333395
Perplexity = 683, Mean sigma: 45.5719512561863
Perplexity = 703, Mean sigma: 59.50498233161318
In [23]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.plot(my_perp, my_sigma, '-o')

plt.gca().legend(('tSNE My Implementation','tSNE Scikitlearn Implementation'), fontsize = 20)
plt.title("Sigma vs. Perplexity for tSNE", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

Comparing how scikitlearn mean sigma and mean sigma computed in my tSNE implementation vary vs. Perplexity, we conclude that they behave very similarly quantitatively while the absolute values of the scikitlearn implementation are systematically a bit larger than the ones from my implementation of the tSNE algorithm.

Checking that Sigma from Leland's Implementation of UMAP Agrees with Mine

Since mean sigma is not really reported by the original implementation of UMAP from Leland McInnes, I want to dig into UMAP codes and extract mean sigma in order to compare it with the mean sigma from my implementation of UMAP. Looking at the umap_.py script, I found a function smooth_knn_dist that seems to perform binary search and return arrays of sigma and rho (local connectivity parameter). The input of the function is the number of nearest neighbors k = n_neighbor for each point.

In [56]:
from umap import umap_
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = 716, bandwidth = 20)
np.mean(sigmas_umap)
Out[56]:
60.742850280143934

Playing with the parameters of the smooth_knn_dist function I noticed that the mean sigma returned does not change that much when I vary n_neighbor. However, I discovered another parameter bandwidth that seems to have a dramatic effect on the mean sigma. Here I demonstrate that increasing bandwidth, we can in principle make mean sigma go to infinity.

In [75]:
from umap import umap_
plt.figure(figsize=(20, 15))

my_bandwidth_n_neighbors = []
my_bandwidth_sigma_umap = []
for bandwidth in [1, 5, 10, 20]:
    my_n_neighbors = []; my_sigma_umap = []
    for n_neighbors in range(3, X_train.shape[0], 20):
        sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors, bandwidth = bandwidth)
        my_sigma_umap.append(np.mean(sigmas_umap))
        my_n_neighbors.append(n_neighbors)
        #print("UMAP N_neighbors = {0}, UMAP Mean sigma: {1}".format(n_neighbors, np.mean(sigmas_umap)))
    my_bandwidth_sigma_umap.append(my_sigma_umap)
    my_bandwidth_n_neighbors.append(my_n_neighbors)
    print('Finished computing mean sigma for bandwidth = {}'.format(bandwidth))
    plt.plot(my_n_neighbors, my_sigma_umap, '-o')

plt.gca().legend(['Bandwidth = {}'.format(bandwidth) for bandwidth in [1, 5, 10, 20]], fontsize = 20)
plt.title("Sigma vs. N_neighbors for UMAP", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Finished computing mean sigma for bandwidth = 1
Finished computing mean sigma for bandwidth = 5
Finished computing mean sigma for bandwidth = 10
Finished computing mean sigma for bandwidth = 20

The bandwidth parameter is connected with entropy and n_neighbors as:

$$\rm{n_{neighbors}} = 2^{\displaystyle\rm{entropy \,/\, bandwidth}}$$

Without the bandwidth parameter (or when bandwidth = 1), the perplexity or number of nearest neighbors would be given by:

$$\rm{n_{neighbors}} = 2^{\displaystyle\rm{entropy}} = 2^{\displaystyle\sum_j{p_{ij}}}$$

Therefore, the bandwidth parameter effectively increases the entropy or serves as a factor that multiplies the values of $\log_2{\rm{n_{neighbor}}}$ by bandwidth and hence raises the effective value of n_neighbor (aka perplexity) parameter.

Now, looking more carefully, we can see that the smooth_knn_dist function has another very important parameter local_connectivity that seems to affect the mean sigma a lot. Let us demonstrate it:

In [106]:
from umap import umap_
plt.figure(figsize=(20, 15))

my_local_connectivity_n_neighbors = []
my_local_connectivity_sigma_umap = []
for local_connectivity in [0, 0.1, 0.5, 1]:
    my_n_neighbors = []; my_sigma_umap = []
    for n_neighbors in range(3, X_train.shape[0], 20):
        sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors, 
                                                       local_connectivity = local_connectivity)
        my_sigma_umap.append(np.mean(sigmas_umap))
        my_n_neighbors.append(n_neighbors)

    my_local_connectivity_sigma_umap.append(my_sigma_umap)
    my_local_connectivity_n_neighbors.append(my_n_neighbors)
    print('Finished computing mean sigma for local_connectivity = {}'.format(local_connectivity))
    plt.plot(my_n_neighbors, my_sigma_umap, '-o')

plt.gca().legend(['Local_connectivity = {}'.format(local_connectivity) 
                  for local_connectivity in [0, 0.1, 0.5, 1]], fontsize = 20)
plt.title("Sigma vs. N_neighbors for UMAP", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Finished computing mean sigma for local_connectivity = 0
Finished computing mean sigma for local_connectivity = 0.1
Finished computing mean sigma for local_connectivity = 0.5
Finished computing mean sigma for local_connectivity = 1

It looks like at local_connectivity = 1, indeed, the mean sigma parameter does not change much as n_neighbor increases. However, when local connectivity decreases, mean sigma can jump up to 100 or even 400, it plateaus quite quickly, which agrees with the behavior from my implementation of UMAP.

Now let us understand why at bandwidth = 1 and local_connectivity = 1, that are the default UMAP parameters in Leland McInnes implementation, the mean sigma from Leland's implementation varies very little with n_neighbor, while in my implementation it varies a lot, let us check it:

In [171]:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np; import pandas as pd; import seaborn as sns; import matplotlib.pyplot as plt

dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]

def prob_high_dim(sigma, dist_row):
    d = dist[dist_row] - rho[dist_row]
    d[d < 0] = 0
    return np.exp(- d / sigma)

def k(prob):
    return np.power(2, np.sum(prob))

def sigma_binary_search(k_of_sigma, fixed_k):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if k_of_sigma(approx_sigma) < fixed_k:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 20):

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: k(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
    
    my_n_neighbor.append(N_NEIGHBOR)
    my_sigma_umap.append(np.mean(sigma_array))

plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
N_neighbor = 3, Mean Sigma = 0.00095367431640625
N_neighbor = 23, Mean Sigma = 72.26585143105278
N_neighbor = 43, Mean Sigma = 82.91104119583215
N_neighbor = 63, Mean Sigma = 88.53755450115524
N_neighbor = 83, Mean Sigma = 92.28989531873991
N_neighbor = 103, Mean Sigma = 95.07648372117367
N_neighbor = 123, Mean Sigma = 97.27877611554534
N_neighbor = 143, Mean Sigma = 99.0917163188231
N_neighbor = 163, Mean Sigma = 100.62791781718505
N_neighbor = 183, Mean Sigma = 101.95771009562402
N_neighbor = 203, Mean Sigma = 103.12805229059144
N_neighbor = 223, Mean Sigma = 104.17187280495074
N_neighbor = 243, Mean Sigma = 105.11270981261184
N_neighbor = 263, Mean Sigma = 105.96843005558632
N_neighbor = 283, Mean Sigma = 106.75261939704085
N_neighbor = 303, Mean Sigma = 107.47582419624541
N_neighbor = 323, Mean Sigma = 108.14648633562653
N_neighbor = 343, Mean Sigma = 108.77148798724127
N_neighbor = 363, Mean Sigma = 109.35646195651432
N_neighbor = 383, Mean Sigma = 109.90591422139599
N_neighbor = 403, Mean Sigma = 110.42387392267834
N_neighbor = 423, Mean Sigma = 110.91354306183713
N_neighbor = 443, Mean Sigma = 111.37780930076897
N_neighbor = 463, Mean Sigma = 111.81909678368595
N_neighbor = 483, Mean Sigma = 112.23934748985248
N_neighbor = 503, Mean Sigma = 112.64057265979618
N_neighbor = 523, Mean Sigma = 113.02424542730746
N_neighbor = 543, Mean Sigma = 113.39183359838731
N_neighbor = 563, Mean Sigma = 113.74458920356281
N_neighbor = 583, Mean Sigma = 114.08359112020311
N_neighbor = 603, Mean Sigma = 114.40984230467727
N_neighbor = 623, Mean Sigma = 114.72423649367008
N_neighbor = 643, Mean Sigma = 115.0276034903926
N_neighbor = 663, Mean Sigma = 115.32060527268735
N_neighbor = 683, Mean Sigma = 115.60397973939693
N_neighbor = 703, Mean Sigma = 115.87817708873216

One thing that we can see immediately comparing my binary search implementation with Leland's one is that I seem to compute parameter rho very differently from the way Leland computes it. Indeed, the distribution of rho values from my implementation is unimodal, while from Leland's implementation it is bimodal, i.e. resembles actually the bimodal shape of the distribution of distances.

In [76]:
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
In [79]:
plt.figure(figsize=(20,15))
sns.distplot(rho)
plt.show()
In [108]:
from umap import umap_
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = 30, bandwidth = 1, local_connectivity = 1)
plt.figure(figsize=(20,15))
sns.distplot(rhos_umap)
plt.show()

So let us further dig into Leland's codes and understand how he uses the localconnectivity hyperparameter for computing the rho values. Looking at the codes [here](https://github.com/lmcinnes/umap/blob/master/umap/umap.py) we can see that the local_connectivity parameter is aka the number of nearest neighbors, i.e. only firt nearest neighbor or first and second nearest neighbors, to subtruct from the rest of the distances for each data point. However, when determining what point is the nearest neighbor for each data point, Leland McInnes does not order or sort all the points by their distances, but takes only the first one in the order they appear in the distance matrix. Here is the piece of code from umap_.py that shows no sorting / ordering of the distances, the parameter rho for each data point gets zero's element of distnace matrx, this is because index-1 = 0 (since index = local_connectivity = 1), see below:

In [174]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/UMAP_no_sorting.png', width=2000)
Out[174]:

Therefore, if the distance matrix was not pre-sorted, Leland's procedure seems to be very strange since my understanding from the original UMAP paper was that the parameter rho reflects the distance from each data point to its first nearest neighbor. Therefore some sorting must be used to dtermine what is the nearest neighbor for each data point. Let us demonstrate that if I do not sort neighbors in my UMAP implementation, I get identical to Lealnd's distribution of rho values.

In [164]:
rho = [dist[i][dist[i]>0][0] for i in range(dist.shape[0])]
In [165]:
plt.figure(figsize=(20,15))
sns.distplot(rho)
plt.show()
In [166]:
rho = np.zeros(dist.shape[0], dtype=np.float32)   
for i in range(dist.shape[0]):
    ith_distances = dist[i]
    non_zero_dists = ith_distances[ith_distances > 0.0]
    rho[i] = non_zero_dists[0]
In [167]:
plt.figure(figsize=(20,15))
sns.distplot(rho)
plt.show()

In addition, in the umap_.py codes there is some mysterious multiplication factor MIN_K_DIST_SCALE = 0.001 that perhaps is supposed to set a minimal value of sigma, which is the result variable in the code below, so if the sigma is too low, it is set to be the mean distance multiplied by MIN_K_DIST_SCALE:

In [176]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/UMAP_multiply_constant.png', width=2000)
Out[176]:

Let us omit sorting data points by their proximity to each data point in my UMAP implementation and compute the mean sigma vs. n_nighbor dependence:

In [177]:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np; import pandas as pd; import seaborn as sns; import matplotlib.pyplot as plt

dist = np.square(euclidean_distances(X_train, X_train))
mean_distances = np.mean(dist)
#rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
rho = [dist[i][dist[i]>0][0] for i in range(dist.shape[0])]

def prob_high_dim(sigma, dist_row):
    d = dist[dist_row] - rho[dist_row]
    d[d < 0] = 0
    return np.exp(- d / sigma)

def k(prob):
    return np.power(2, np.sum(prob))

def sigma_binary_search(k_of_sigma, fixed_k):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if k_of_sigma(approx_sigma) < fixed_k:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 20):

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        mean_ith_distances = np.mean(dist[dist_row])
        func = lambda sigma: k(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        if binary_search_result < mean_ith_distances * 1e-3:
            binary_search_result = mean_ith_distances * 1e-3
        sigma_array.append(binary_search_result)
        
    print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
    
    my_n_neighbor.append(N_NEIGHBOR)
    my_sigma_umap.append(np.mean(sigma_array))

plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
N_neighbor = 3, Mean Sigma = 2.2327677554763565
N_neighbor = 23, Mean Sigma = 2.2826117370662926
N_neighbor = 43, Mean Sigma = 2.3063927017386074
N_neighbor = 63, Mean Sigma = 2.319144765852118
N_neighbor = 83, Mean Sigma = 2.331129658330073
N_neighbor = 103, Mean Sigma = 2.340376036967632
N_neighbor = 123, Mean Sigma = 2.3477976477092737
N_neighbor = 143, Mean Sigma = 2.3538739916079403
N_neighbor = 163, Mean Sigma = 2.358959366663861
N_neighbor = 183, Mean Sigma = 2.3633041789824603
N_neighbor = 203, Mean Sigma = 2.3670869095112224
N_neighbor = 223, Mean Sigma = 2.3704247696186442
N_neighbor = 243, Mean Sigma = 2.3734056678310984
N_neighbor = 263, Mean Sigma = 2.3828653029824163
N_neighbor = 283, Mean Sigma = 2.3907309391745777
N_neighbor = 303, Mean Sigma = 2.3972601451788003
N_neighbor = 323, Mean Sigma = 2.4028676436034795
N_neighbor = 343, Mean Sigma = 2.4079103963435826
N_neighbor = 363, Mean Sigma = 2.4125295898202546
N_neighbor = 383, Mean Sigma = 2.4168051408756512
N_neighbor = 403, Mean Sigma = 2.42077967182559
N_neighbor = 423, Mean Sigma = 2.424482485512195
N_neighbor = 443, Mean Sigma = 2.427942884777591
N_neighbor = 463, Mean Sigma = 2.431187508569163
N_neighbor = 483, Mean Sigma = 2.4342350041500818
N_neighbor = 503, Mean Sigma = 2.4371066826782544
N_neighbor = 523, Mean Sigma = 2.4448952920240745
N_neighbor = 543, Mean Sigma = 2.449410593605942
N_neighbor = 563, Mean Sigma = 2.453081440555685
N_neighbor = 583, Mean Sigma = 2.4563633588735967
N_neighbor = 603, Mean Sigma = 2.4593762238229138
N_neighbor = 623, Mean Sigma = 2.4621866327720996
N_neighbor = 643, Mean Sigma = 2.464829216352756
N_neighbor = 663, Mean Sigma = 2.46732794961753
N_neighbor = 683, Mean Sigma = 2.4697014798295913
N_neighbor = 703, Mean Sigma = 2.4719657903573715

The values of mean sigma that we obtain are very close to the ones from the original UMAP binary search implementation for default local_connectivity = 1 and bandwidth = 1:

In [178]:
from umap import umap_
plt.figure(figsize=(20, 15))

my_n_neighbors = []; my_sigma_umap = []
for n_neighbors in range(3, X_train.shape[0], 20):
    sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors)
    my_sigma_umap.append(np.mean(sigmas_umap))
    my_n_neighbors.append(n_neighbors)
    print("N_neighbor = {0}, Mean Sigma = {1}".format(n_neighbors, np.mean(sigmas_umap)))

plt.plot(my_n_neighbors, my_sigma_umap, '-o')
plt.title("Sigma vs. N_neighbors for UMAP", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
N_neighbor = 3, Mean Sigma = 2.2327677554763565
N_neighbor = 23, Mean Sigma = 2.3088340866774497
N_neighbor = 43, Mean Sigma = 2.3326607051921155
N_neighbor = 63, Mean Sigma = 2.3487845567764385
N_neighbor = 83, Mean Sigma = 2.3596527358419936
N_neighbor = 103, Mean Sigma = 2.3676132205225104
N_neighbor = 123, Mean Sigma = 2.3738273836793717
N_neighbor = 143, Mean Sigma = 2.3917966782148463
N_neighbor = 163, Mean Sigma = 2.403656934303359
N_neighbor = 183, Mean Sigma = 2.4131952118707907
N_neighbor = 203, Mean Sigma = 2.4213509067577497
N_neighbor = 223, Mean Sigma = 2.428441810950967
N_neighbor = 243, Mean Sigma = 2.434675580373104
N_neighbor = 263, Mean Sigma = 2.445680416392583
N_neighbor = 283, Mean Sigma = 2.4535954656844385
N_neighbor = 303, Mean Sigma = 2.459808435416457
N_neighbor = 323, Mean Sigma = 2.4652105582123713
N_neighbor = 343, Mean Sigma = 2.4700480205683424
N_neighbor = 363, Mean Sigma = 2.474447240273082
N_neighbor = 383, Mean Sigma = 2.4784889439927484
N_neighbor = 403, Mean Sigma = 2.4822300751412736
N_neighbor = 423, Mean Sigma = 2.485713597013
N_neighbor = 443, Mean Sigma = 2.488973181237424
N_neighbor = 463, Mean Sigma = 2.4920360208520362
N_neighbor = 483, Mean Sigma = 2.494924961418115
N_neighbor = 503, Mean Sigma = 2.4976573075958823
N_neighbor = 523, Mean Sigma = 2.5002496820885063
N_neighbor = 543, Mean Sigma = 2.503482000793514
N_neighbor = 563, Mean Sigma = 2.5064477467726953
N_neighbor = 583, Mean Sigma = 2.50919876152479
N_neighbor = 603, Mean Sigma = 2.5117867885412006
N_neighbor = 623, Mean Sigma = 2.514239361837945
N_neighbor = 643, Mean Sigma = 2.516574041809139
N_neighbor = 663, Mean Sigma = 2.5188042118619496
N_neighbor = 683, Mean Sigma = 2.5209412095323844
N_neighbor = 703, Mean Sigma = 2.5229923658587636

Therefore, we conclude that we managed to reproduce the original Leland McInnes implementation of the binary search for sigma parameter. The two main discrepancies between my and Leland's computations of sigma was the absence of sorting in Leland's code (which is again very-very strange), and presence of some sort of minimal sigma that did not allow very small sigmas (probably because of singularities in that case) and set each sigma close to zero to mean_sigma * 0.001.

In [179]:
from umap import umap_
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = 703)
np.mean(sigmas_umap)
Out[179]:
2.5229923658587636

Quantify How Well Dimension Reduction Algorithms Preserve Global Structure

Now let us try to quantify how well such dimension reduction algorithms as PCA / MDS, tSNE and UMAP are capable of preserving global data structure. By global data structure we mean here: 1) the distances between the clustesrs, 2) the correlation between original and transformed centroid coordinates, and 3) the shapes of the clusters. We are going to use the 2D world map syntetic data set, and by shapes of the clusters we mean the sizes of the bounding boxes drawn around of the continetn / cluster.

In [2]:
import cartopy
import numpy as np
import cartopy.crs as ccrs
from skimage.io import imread
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from matplotlib import pyplot as plt
import cartopy.io.shapereader as shpreader

import warnings
warnings.filterwarnings("ignore")

shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
                                        category='cultural', name=shapename)

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    #print(country.attributes['NAME_LONG'])
    if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('NorthAmerica.png')
plt.close()
        
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela', 
                                           'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('SouthAmerica.png')
plt.close()

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Australia']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Australia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia', 
                                          'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden', 
                                           'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway', 
                                           'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria', 
                                           'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland', 
                                           'United Kingdom', 'Saudi Arabia', 'Hungary']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Eurasia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
                                           'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia', 
                                           'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
                                           'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic', 
                                           'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia', 
                                           'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa', 
                                           'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
                                           'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 
                                           'Nigeria', 'Cameroon']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Africa.png')
plt.close()


rng = np.random.RandomState(123)
fig = plt.figure(figsize=(20,15))

N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)

N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)

N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)

N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)

N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)

NorthAmerica_bb_coords = [np.min(X_NorthAmerica[:,0]), np.max(X_NorthAmerica[:,0]), 
                          np.min(X_NorthAmerica[:,1]), np.max(X_NorthAmerica[:,1])]
Eurasia_bb_coords = [np.min(X_Eurasia[:,0]), np.max(X_Eurasia[:,0]), 
                          np.min(X_Eurasia[:,1]), np.max(X_Eurasia[:,1])]
Africa_bb_coords = [np.min(X_Africa[:,0]), np.max(X_Africa[:,0]), 
                          np.min(X_Africa[:,1]), np.max(X_Africa[:,1])]
SouthAmerica_bb_coords = [np.min(X_SouthAmerica[:,0]), np.max(X_SouthAmerica[:,0]), 
                          np.min(X_SouthAmerica[:,1]), np.max(X_SouthAmerica[:,1])]
Australia_bb_coords = [np.min(X_Australia[:,0]), np.max(X_Australia[:,0]), 
                          np.min(X_Australia[:,1]), np.max(X_Australia[:,1])]

ax = fig.add_subplot(1, 1, 1)
rect1 = plt.Rectangle((NorthAmerica_bb_coords[0], NorthAmerica_bb_coords[2]), 
                      NorthAmerica_bb_coords[1] - NorthAmerica_bb_coords[0], 
                      NorthAmerica_bb_coords[3] - NorthAmerica_bb_coords[2], 
                      fill = False, ec = 'brown')
rect2 = plt.Rectangle((Eurasia_bb_coords[0], Eurasia_bb_coords[2]), 
                      Eurasia_bb_coords[1] - Eurasia_bb_coords[0],
                      Eurasia_bb_coords[3] - Eurasia_bb_coords[2], 
                      fill = False, ec = 'blue')
rect3 = plt.Rectangle((Africa_bb_coords[0], Africa_bb_coords[2]), 
                      Africa_bb_coords[1] - Africa_bb_coords[0],
                      Africa_bb_coords[3] - Africa_bb_coords[2], 
                      fill = False, ec = 'darkgreen')
rect4 = plt.Rectangle((SouthAmerica_bb_coords[0], SouthAmerica_bb_coords[2]), 
                      SouthAmerica_bb_coords[1] - SouthAmerica_bb_coords[0], 
                      SouthAmerica_bb_coords[3] - SouthAmerica_bb_coords[2], 
                      fill = False, ec = 'red')
rect5 = plt.Rectangle((Australia_bb_coords[0], Australia_bb_coords[2]), 
                      Australia_bb_coords[1] - Australia_bb_coords[0],
                      Australia_bb_coords[3] - Australia_bb_coords[2], 
                      fill = False, ec = 'darkorange')
ax.add_patch(rect1)
ax.add_patch(rect2)
ax.add_patch(rect3)
ax.add_patch(rect4)
ax.add_patch(rect5)

X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)

plt.show()
(3023, 2)
(3023,)

Let us check how correlated pairwise distances between data points are after applying PCA, UMAP and tSNE to the syntetic World's Map data det. For this purpose we will use Spearman correlation between original distances and distances between data points after dimension reduction. We are going to apply bootstrapping, i.e. subsampling with replacement multiple times, for building confidence intervals for a more robust comparision of the dimension reduction algorithms.

In [12]:
import warnings
warnings.filterwarnings("ignore")

import random
from umap import UMAP
from sklearn.manifold import TSNE
from scipy.stats import spearmanr, pearsonr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

coef_pca_list = []; coef_tsne_list = []; coef_umap_list = []
for i in range(10):
    print('Working with iteration {}'.format(i + 1))
    X_boot = X[random.sample(range(X.shape[0]), int(round(X.shape[0] * 0.9, 0))), :]
    X_reduced = PCA(n_components = 2).fit_transform(X_boot)
    model_tsne = TSNE(learning_rate = 200, n_components = 2, perplexity = 500, 
                 init = X_reduced, n_iter = 1000, verbose = 0)
    tsne = model_tsne.fit_transform(X_boot)

    model_umap = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = 500, 
                 init = X_reduced, n_epochs = 1000, verbose = 0, spread = 2)
    umap = model_umap.fit_transform(X_boot)

    dist_orig = np.square(euclidean_distances(X_boot, X_boot)).flatten()
    dist_pca = np.square(euclidean_distances(X_reduced, X_reduced)).flatten()
    dist_tsne = np.square(euclidean_distances(tsne, tsne)).flatten()
    dist_umap = np.square(euclidean_distances(umap, umap)).flatten()

    coef_pca, p_pca = spearmanr(dist_orig, dist_pca)
    coef_tsne, p_tsne = spearmanr(dist_orig, dist_tsne)
    coef_umap, p_umap = spearmanr(dist_orig, dist_umap)
    coef_pca_list.append(coef_pca); coef_tsne_list.append(coef_tsne); coef_umap_list.append(coef_umap)
    print('Spearman correlation coeffcient for PCA dimension reduction = {}'.format(coef_pca))
    print('Spearman correlation coeffcient for tSNE dimension reduction = {}'.format(coef_tsne))
    print('Spearman correlation coeffcient for UMAP dimension reduction = {}'.format(coef_umap))
    print('****************************************************************************')
Working with iteration 1
Spearman correlation coeffcient for PCA dimension reduction = 0.999999999999995
Spearman correlation coeffcient for tSNE dimension reduction = 0.937240521872445
Spearman correlation coeffcient for UMAP dimension reduction = 0.9532903814214244
****************************************************************************
Working with iteration 2
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999982
Spearman correlation coeffcient for tSNE dimension reduction = 0.9402015577883607
Spearman correlation coeffcient for UMAP dimension reduction = 0.9556804540616886
****************************************************************************
Working with iteration 3
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999998
Spearman correlation coeffcient for tSNE dimension reduction = 0.9406335316188061
Spearman correlation coeffcient for UMAP dimension reduction = 0.9327735354321872
****************************************************************************
Working with iteration 4
Spearman correlation coeffcient for PCA dimension reduction = 0.999999999999995
Spearman correlation coeffcient for tSNE dimension reduction = 0.941842289295808
Spearman correlation coeffcient for UMAP dimension reduction = 0.9583347336463377
****************************************************************************
Working with iteration 5
Spearman correlation coeffcient for PCA dimension reduction = 1.0
Spearman correlation coeffcient for tSNE dimension reduction = 0.9416493610159327
Spearman correlation coeffcient for UMAP dimension reduction = 0.9575100286300631
****************************************************************************
Working with iteration 6
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999915
Spearman correlation coeffcient for tSNE dimension reduction = 0.9423348040237298
Spearman correlation coeffcient for UMAP dimension reduction = 0.9440038061963655
****************************************************************************
Working with iteration 7
Spearman correlation coeffcient for PCA dimension reduction = 1.0
Spearman correlation coeffcient for tSNE dimension reduction = 0.9404965855315344
Spearman correlation coeffcient for UMAP dimension reduction = 0.9552294818657279
****************************************************************************
Working with iteration 8
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999896
Spearman correlation coeffcient for tSNE dimension reduction = 0.9419682973543229
Spearman correlation coeffcient for UMAP dimension reduction = 0.9603953057246984
****************************************************************************
Working with iteration 9
Spearman correlation coeffcient for PCA dimension reduction = 0.999999999999995
Spearman correlation coeffcient for tSNE dimension reduction = 0.9390097057141956
Spearman correlation coeffcient for UMAP dimension reduction = 0.9481289542255661
****************************************************************************
Working with iteration 10
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999932
Spearman correlation coeffcient for tSNE dimension reduction = 0.9408607949673812
Spearman correlation coeffcient for UMAP dimension reduction = 0.9663658439374344
****************************************************************************
In [42]:
import matplotlib
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
plt.boxplot([coef_pca_list, coef_umap_list, coef_tsne_list], labels = ['PCA', 'UMAP', 'tSNE'], patch_artist = True)
plt.ylabel('Spearman Correlation Coefficient', fontsize = 22)
plt.title('Correlation of original with reconstructed distances between data points', fontsize = 22)
plt.show()
In [39]:
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(coef_umap_list, coef_tsne_list)
print('Statistics = %.3f, p = %.3f' % (stat, p))
Statistics = 10.000, p = 0.001

We conclude that PCA perfectly reconstructs the original 2D World's Map data, which is not surprising because the World's Map is so far a linear manifold. So the Spearman correlation coefficient between the original and reconstructed distances is close to one. Both UMAP and tSNE preserve the majority of pairwise distances with the Spearman correlation coefficient > 0.9. However, as we can see from both the boxplot and Mann-Whittney U test, the Spearman correlation coefficient for UMAP is significantly higher than the one for tSNE, implying UMAP is superior in global structure preservation on a linear manifold even if one fixes the initialization to be PCA for both UMAP and tSNE removing thus this layer of uncertanty for comparison. Computing correlations betwen all pairs of points within and between clusters we can see that UMAP better preserves both local and global structure.

Let us now check how the dimension reduction techniques can preserve the distances between centroids of the clusters. This time we are not going to apply bootstrapping but use the whole data set in order to demonstrate an interesting effect about tSNE vs. UMAP which is the absense of stochasticity for tSNE and its presence for UMAP.

In [33]:
import warnings
warnings.filterwarnings("ignore")

import random
from umap import UMAP
from sklearn.manifold import TSNE
from scipy.stats import spearmanr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

coef_pca_centroids_list = []; coef_tsne_centroids_list = []; coef_umap_centroids_list = []
for i in range(10):
    print('Working with iteration {}'.format(i + 1))
    #X_boot = X[random.sample(range(X.shape[0]), int(round(X.shape[0] * 0.8, 0))), :]
    X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
    X_centroids = np.vstack((np.mean(X_NorthAmerica, axis = 0), 
                             np.mean(X_SouthAmerica, axis = 0), 
                             np.mean(X_Australia, axis = 0), 
                             np.mean(X_Eurasia, axis = 0), 
                             np.mean(X_Africa, axis = 0)))
    X_reduced = PCA(n_components = 2).fit_transform(X)
    X_pca_centroids = np.vstack((np.mean(X_reduced[y == 'brown'], axis = 0), 
                                 np.mean(X_reduced[y == 'red'], axis = 0), 
                                 np.mean(X_reduced[y == 'darkorange'], axis = 0), 
                                 np.mean(X_reduced[y == 'blue'], axis = 0), 
                                 np.mean(X_reduced[y == 'darkgreen'], axis = 0)))
    model_tsne = TSNE(learning_rate = 200, n_components = 2, perplexity = 500, 
                      init = X_reduced, n_iter = 1000, verbose = 0)
    tsne = model_tsne.fit_transform(X)
    X_tsne_centroids = np.vstack((np.mean(tsne[y == 'brown'], axis = 0), 
                                  np.mean(tsne[y == 'red'], axis = 0), 
                                  np.mean(tsne[y == 'darkorange'], axis = 0), 
                                  np.mean(tsne[y == 'blue'], axis = 0), 
                                  np.mean(tsne[y == 'darkgreen'], axis = 0)))
    model_umap = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = 500, 
                      init = X_reduced, n_epochs = 1000, verbose = 0, spread = 2)
    umap = model_umap.fit_transform(X)
    X_umap_centroids = np.vstack((np.mean(umap[y == 'brown'], axis = 0), 
                                  np.mean(umap[y == 'red'], axis = 0), 
                                  np.mean(umap[y == 'darkorange'], axis = 0), 
                                  np.mean(umap[y == 'blue'], axis = 0), 
                                  np.mean(umap[y == 'darkgreen'], axis = 0)))
    
    #from sklearn.metrics.pairwise import pairwise_distances
    #np.square(pairwise_distances(X_centroids, X_centroids, metric = 'mahalanobis'))
    
    dist_centroids_orig = np.square(euclidean_distances(X_centroids, X_centroids)).flatten()
    dist_centroids_pca = np.square(euclidean_distances(X_pca_centroids, X_pca_centroids)).flatten()
    dist_centroids_tsne = np.square(euclidean_distances(X_tsne_centroids, X_tsne_centroids)).flatten()
    dist_centroids_umap = np.square(euclidean_distances(X_umap_centroids, X_umap_centroids)).flatten()
    
    coef_centroids_pca, p_centroids_pca = spearmanr(dist_centroids_orig, dist_centroids_pca)
    coef_centroids_tsne, p_centroids_tsne = spearmanr(dist_centroids_orig, dist_centroids_tsne)
    coef_centroids_umap, p_centroids_umap = spearmanr(dist_centroids_orig, dist_centroids_umap)
    
    coef_pca_centroids_list.append(coef_centroids_pca); coef_tsne_centroids_list.append(coef_centroids_tsne); 
    coef_umap_centroids_list.append(coef_centroids_umap)
    print('Spearman correlation coeffcient for PCA dimension reduction = {}'.format(coef_centroids_pca))
    print('Spearman correlation coeffcient for tSNE dimension reduction = {}'.format(coef_centroids_tsne))
    print('Spearman correlation coeffcient for UMAP dimension reduction = {}'.format(coef_centroids_umap))
    print('****************************************************************************')
Working with iteration 1
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.91284046692607
****************************************************************************
Working with iteration 2
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.9252918287937743
****************************************************************************
Working with iteration 3
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.8754863813229572
****************************************************************************
Working with iteration 4
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.8630350194552528
****************************************************************************
Working with iteration 5
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.9315175097276265
****************************************************************************
Working with iteration 6
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.9003891050583657
****************************************************************************
Working with iteration 7
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.9315175097276265
****************************************************************************
Working with iteration 8
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.9252918287937743
****************************************************************************
Working with iteration 9
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.9252918287937743
****************************************************************************
Working with iteration 10
Spearman correlation coeffcient for PCA dimension reduction = 0.9996111218985752
Spearman correlation coeffcient for tSNE dimension reduction = 0.9309742979933798
Spearman correlation coeffcient for UMAP dimension reduction = 0.91284046692607
****************************************************************************
In [34]:
import matplotlib
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
plt.boxplot([coef_pca_centroids_list, coef_umap_centroids_list, coef_tsne_centroids_list], 
            labels = ['PCA', 'UMAP', 'tSNE'], patch_artist = True)
plt.title('Preservation of distances between centroids of clusters by dimension reduction algorithms', 
          fontsize = 22)
plt.ylabel('Spearman Correlation coefficient', fontsize = 24)
plt.show()

We conclude that we do not observe a significant different in distance preservation between centroids of clusters by UMAP and tSNE, PCA of course perfectly preserved the distances between centroids. Here we observe a very interesting lack of stochasticity for tSNE, i.e. the result becomes deterministic (the same from run to run) when initialization is not random. In contrast, initialization is not the only source of stocasticity for UMAP but it also comes from the stocastic gradient descent. Here we see that for a non-random initialization, the result of UMAP still varies from run to run because of stochastic gradient descent.

Measuring distances between centroids, i.e. ignoring the variation in the data, is perhaps not a very good idea since the clusters are elogated. Therefore a better idea is to compute Mahalanobis distances between all pairs of clusters. Mahalanobis distance first calculates distances between each point of one cluster to the centroid of the second cluster and normalizes those distances by the "thickness" of the variation (assuming the clustesr have ellipsoidal symmetry) in the second cluster.

In [184]:
import matplotlib
from scipy.stats import spearmanr
from sklearn.metrics.pairwise import pairwise_distances

figure = plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})

clusters = list(set(y))
clust_dict = {'brown': 'North America', 'red': 'South America', 'blue': 'Eurasia', 
              'darkgreen': 'Africa', 'darkorange': 'Australia'}

coef_tsne_all = []; coef_umap_all = []
for clust_index, j in enumerate(clusters):
    coef_tsne = []; coef_umap = []
    for i in clusters:
        if(i!=j):
            orig_dist = np.square(pairwise_distances(X[y == j], X[y == i], metric = 'mahalanobis')).flatten()
            tsne_dist = np.square(pairwise_distances(tsne[y == j], 
                                                     tsne[y == i], metric = 'mahalanobis')).flatten()
            umap_dist = np.square(pairwise_distances(umap[y == j], 
                                                     umap[y == i], metric = 'mahalanobis')).flatten()
            coef_tsne_current, _ = spearmanr(orig_dist, tsne_dist)
            coef_umap_current, _ = spearmanr(orig_dist, umap_dist)
            coef_tsne.append(np.abs(coef_tsne_current))
            coef_umap.append(np.abs(coef_umap_current))
    
    plt.subplot(321 + clust_index)
    plt.boxplot([coef_tsne, coef_umap], labels = ['tSNE', 'UMAP'], patch_artist = True)
    plt.title('Distances Between {} and Others'.format(clust_dict[j]), fontsize = 22)
    plt.ylabel('Spearman Rho', fontsize = 24)
    
    coef_tsne_all.append(coef_tsne)
    coef_umap_all.append(coef_umap)

coef_tsne_all = [y for x in coef_tsne_all for y in x]
coef_umap_all = [y for x in coef_umap_all for y in x]

plt.subplot(326)
plt.boxplot([coef_tsne_all, coef_umap_all], labels = ['tSNE', 'UMAP'], patch_artist = True)
plt.title('Distances Between All Pairs of Clusters', fontsize = 22)
plt.ylabel('Spearman Rho', fontsize = 24)

figure.tight_layout()
plt.show()
In [185]:
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(coef_tsne_all, coef_umap_all)
print('Statistics = %.3f, p = %.3f' % (stat, p))
Statistics = 132.000, p = 0.034

Here we observe that while being somewhat unclear for South America, Australia and Africa, at least for North America and Eurasia, UMAP preserves the original mahalanobis distance between those clustesr and the other ones much better than tSNE. Averaging across all clusters (the last plot in the figure above) and performing a Mann-Whittney U test, we demonstrate that indeed UMAP significantly better preserves original Mahalanobis distances between the continets / clusters.

Let us now estimate how well the shape of the clusters is preserved by the different dimension reduction methods. By shape of the clusters we understand the sizes (heigh and width) of the bounding box wrapped around the clusters. The scale of the bounding boxes is changed a lot during tSNE and UMAP dimension reduction, i.e. it can be stratched or squeezed, however both width and height should change proportionally towards each other (keeping its scaling ratio) and towards the initial dimension of the bounding box. Therefore we are going to use Spearman correlation coefficient between original and reconstructed sizes of the bounding boxes as a criterion of how well the dimension reduction algorithms can preserve the shapes of the clusters.

In [226]:
import warnings
warnings.filterwarnings("ignore")

import random
from umap import UMAP
from sklearn.manifold import TSNE
from scipy.stats import spearmanr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

coef_sizes_pca_list = []; coef_sizes_tsne_list = []; coef_sizes_umap_list = []
for i in range(10):
    print('Working with iteration {}'.format(i + 1))
    
    NorthAmerica_bb_coords = [np.min(X_NorthAmerica[:,0]), np.max(X_NorthAmerica[:,0]), 
                              np.min(X_NorthAmerica[:,1]), np.max(X_NorthAmerica[:,1])]
    NorthAmerica_bb_sizes = np.array([NorthAmerica_bb_coords[1] - NorthAmerica_bb_coords[0], 
                                      NorthAmerica_bb_coords[3] - NorthAmerica_bb_coords[2]])
    SouthAmerica_bb_coords = [np.min(X_SouthAmerica[:,0]), np.max(X_SouthAmerica[:,0]), 
                              np.min(X_SouthAmerica[:,1]), np.max(X_SouthAmerica[:,1])]
    SouthAmerica_bb_sizes = np.array([SouthAmerica_bb_coords[1] - SouthAmerica_bb_coords[0], 
                                      SouthAmerica_bb_coords[3] - SouthAmerica_bb_coords[2]])
    Australia_bb_coords = [np.min(X_Australia[:,0]), np.max(X_Australia[:,0]), 
                           np.min(X_Australia[:,1]), np.max(X_Australia[:,1])]
    Australia_bb_sizes = np.array([Australia_bb_coords[1] - Australia_bb_coords[0], 
                                   Australia_bb_coords[3] - Australia_bb_coords[2]])
    Eurasia_bb_coords = [np.min(X_Eurasia[:,0]), np.max(X_Eurasia[:,0]), 
                         np.min(X_Eurasia[:,1]), np.max(X_Eurasia[:,1])]
    Eurasia_bb_sizes = np.array([Eurasia_bb_coords[1] - Eurasia_bb_coords[0], 
                                 Eurasia_bb_coords[3] - Eurasia_bb_coords[2]])
    Africa_bb_coords = [np.min(X_Africa[:,0]), np.max(X_Africa[:,0]), 
                        np.min(X_Africa[:,1]), np.max(X_Africa[:,1])]
    Africa_bb_sizes = np.array([Africa_bb_coords[1] - Africa_bb_coords[0], 
                                Africa_bb_coords[3] - Africa_bb_coords[2]])
    orig_bb_sizes = np.vstack((NorthAmerica_bb_sizes, SouthAmerica_bb_sizes, Australia_bb_sizes, 
                               Eurasia_bb_sizes, Africa_bb_sizes)).flatten()
    
    
    X_reduced = PCA(n_components = 2).fit_transform(X)
    NorthAmerica_pca_bb_coords = [np.min(X_reduced[y == 'brown'][:,0]), np.max(X_reduced[y == 'brown'][:,0]), 
                                  np.min(X_reduced[y == 'brown'][:,1]), np.max(X_reduced[y == 'brown'][:,1])]
    NorthAmerica_pca_bb_sizes = np.array([NorthAmerica_pca_bb_coords[1] - NorthAmerica_pca_bb_coords[0], 
                                          NorthAmerica_pca_bb_coords[3] - NorthAmerica_pca_bb_coords[2]])
    SouthAmerica_pca_bb_coords = [np.min(X_reduced[y == 'red'][:,0]), np.max(X_reduced[y == 'red'][:,0]), 
                                  np.min(X_reduced[y == 'red'][:,1]), np.max(X_reduced[y == 'red'][:,1])]
    SouthAmerica_pca_bb_sizes = np.array([SouthAmerica_pca_bb_coords[1] - SouthAmerica_pca_bb_coords[0], 
                                          SouthAmerica_pca_bb_coords[3] - SouthAmerica_pca_bb_coords[2]])
    Australia_pca_bb_coords = [np.min(X_reduced[y == 'darkorange'][:,0]), 
                               np.max(X_reduced[y == 'darkorange'][:,0]), 
                               np.min(X_reduced[y == 'darkorange'][:,1]), 
                               np.max(X_reduced[y == 'darkorange'][:,1])]
    Australia_pca_bb_sizes = np.array([Australia_pca_bb_coords[1] - Australia_pca_bb_coords[0], 
                                       Australia_pca_bb_coords[3] - Australia_pca_bb_coords[2]])
    Eurasia_pca_bb_coords = [np.min(X_reduced[y == 'blue'][:,0]), np.max(X_reduced[y == 'blue'][:,0]), 
                             np.min(X_reduced[y == 'blue'][:,1]), np.max(X_reduced[y == 'blue'][:,1])]
    Eurasia_pca_bb_sizes = np.array([Eurasia_pca_bb_coords[1] - Eurasia_pca_bb_coords[0], 
                                     Eurasia_pca_bb_coords[3] - Eurasia_pca_bb_coords[2]])
    Africa_pca_bb_coords = [np.min(X_reduced[y == 'darkgreen'][:,0]), np.max(X_reduced[y == 'darkgreen'][:,0]), 
                            np.min(X_reduced[y == 'darkgreen'][:,1]), np.max(X_reduced[y == 'darkgreen'][:,1])]
    Africa_pca_bb_sizes = np.array([Africa_pca_bb_coords[1] - Africa_pca_bb_coords[0], 
                                    Africa_pca_bb_coords[3] - Africa_pca_bb_coords[2]])
    pca_bb_sizes = np.vstack((NorthAmerica_pca_bb_sizes, SouthAmerica_pca_bb_sizes, 
                              Australia_pca_bb_sizes, Eurasia_pca_bb_sizes, Africa_pca_bb_sizes)).flatten()
    
    
    model_tsne = TSNE(learning_rate = 200, n_components = 2, perplexity = 500, 
                      init = X_reduced, n_iter = 1000, verbose = 0)
    tsne = model_tsne.fit_transform(X)
    NorthAmerica_tsne_bb_coords = [np.min(tsne[y == 'brown'][:,0]), np.max(tsne[y == 'brown'][:,0]), 
                                   np.min(tsne[y == 'brown'][:,1]), np.max(tsne[y == 'brown'][:,1])]
    NorthAmerica_tsne_bb_sizes = np.array([NorthAmerica_tsne_bb_coords[1] - NorthAmerica_tsne_bb_coords[0], 
                                           NorthAmerica_tsne_bb_coords[3] - NorthAmerica_tsne_bb_coords[2]])
    SouthAmerica_tsne_bb_coords = [np.min(tsne[y == 'red'][:,0]), np.max(tsne[y == 'red'][:,0]), 
                                   np.min(tsne[y == 'red'][:,1]), np.max(tsne[y == 'red'][:,1])]
    SouthAmerica_tsne_bb_sizes = np.array([SouthAmerica_tsne_bb_coords[1] - SouthAmerica_tsne_bb_coords[0], 
                                           SouthAmerica_tsne_bb_coords[3] - SouthAmerica_tsne_bb_coords[2]])
    Australia_tsne_bb_coords = [np.min(tsne[y == 'darkorange'][:,0]), np.max(tsne[y == 'darkorange'][:,0]), 
                                np.min(tsne[y == 'darkorange'][:,1]), np.max(tsne[y == 'darkorange'][:,1])]
    Australia_tsne_bb_sizes = np.array([Australia_tsne_bb_coords[1] - Australia_tsne_bb_coords[0], 
                                        Australia_tsne_bb_coords[3] - Australia_tsne_bb_coords[2]])
    Eurasia_tsne_bb_coords = [np.min(tsne[y == 'blue'][:,0]), np.max(tsne[y == 'blue'][:,0]), 
                              np.min(tsne[y == 'blue'][:,1]), np.max(tsne[y == 'blue'][:,1])]
    Eurasia_tsne_bb_sizes = np.array([Eurasia_tsne_bb_coords[1] - Eurasia_tsne_bb_coords[0], 
                                      Eurasia_tsne_bb_coords[3] - Eurasia_tsne_bb_coords[2]])
    Africa_tsne_bb_coords = [np.min(tsne[y == 'darkgreen'][:,0]), np.max(tsne[y == 'darkgreen'][:,0]), 
                             np.min(tsne[y == 'darkgreen'][:,1]), np.max(tsne[y == 'darkgreen'][:,1])]
    Africa_tsne_bb_sizes = np.array([Africa_tsne_bb_coords[1] - Africa_tsne_bb_coords[0], 
                                     Africa_tsne_bb_coords[3] - Africa_tsne_bb_coords[2]])
    tsne_bb_sizes = np.vstack((NorthAmerica_tsne_bb_sizes, SouthAmerica_tsne_bb_sizes, 
                               Australia_tsne_bb_sizes, Eurasia_tsne_bb_sizes, Africa_tsne_bb_sizes)).flatten()
    
    
    model_umap = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = 500, 
                      init = X_reduced, n_epochs = 1000, verbose = 0, spread = 2)
    umap = model_umap.fit_transform(X)
    NorthAmerica_umap_bb_coords = [np.min(umap[y == 'brown'][:,0]), np.max(umap[y == 'brown'][:,0]), 
                                   np.min(umap[y == 'brown'][:,1]), np.max(umap[y == 'brown'][:,1])]
    NorthAmerica_umap_bb_sizes = np.array([NorthAmerica_umap_bb_coords[1] - NorthAmerica_umap_bb_coords[0], 
                                           NorthAmerica_umap_bb_coords[3] - NorthAmerica_umap_bb_coords[2]])
    SouthAmerica_umap_bb_coords = [np.min(umap[y == 'red'][:,0]), np.max(umap[y == 'red'][:,0]), 
                                   np.min(umap[y == 'red'][:,1]), np.max(umap[y == 'red'][:,1])]
    SouthAmerica_umap_bb_sizes = np.array([SouthAmerica_umap_bb_coords[1] - SouthAmerica_umap_bb_coords[0], 
                                           SouthAmerica_umap_bb_coords[3] - SouthAmerica_umap_bb_coords[2]])
    Australia_umap_bb_coords = [np.min(umap[y == 'darkorange'][:,0]), np.max(umap[y == 'darkorange'][:,0]), 
                                np.min(umap[y == 'darkorange'][:,1]), np.max(umap[y == 'darkorange'][:,1])]
    Australia_umap_bb_sizes = np.array([Australia_umap_bb_coords[1] - Australia_umap_bb_coords[0], 
                                        Australia_umap_bb_coords[3] - Australia_umap_bb_coords[2]])
    Eurasia_umap_bb_coords = [np.min(umap[y == 'blue'][:,0]), np.max(umap[y == 'blue'][:,0]), 
                              np.min(umap[y == 'blue'][:,1]), np.max(umap[y == 'blue'][:,1])]
    Eurasia_umap_bb_sizes = np.array([Eurasia_umap_bb_coords[1] - Eurasia_umap_bb_coords[0], 
                                      Eurasia_umap_bb_coords[3] - Eurasia_umap_bb_coords[2]])
    Africa_umap_bb_coords = [np.min(umap[y == 'darkgreen'][:,0]), np.max(umap[y == 'darkgreen'][:,0]), 
                             np.min(umap[y == 'darkgreen'][:,1]), np.max(umap[y == 'darkgreen'][:,1])]
    Africa_umap_bb_sizes = np.array([Africa_umap_bb_coords[1] - Africa_umap_bb_coords[0], 
                                     Africa_umap_bb_coords[3] - Africa_umap_bb_coords[2]])
    umap_bb_sizes = np.vstack((NorthAmerica_umap_bb_sizes, SouthAmerica_umap_bb_sizes, 
                               Australia_umap_bb_sizes, Eurasia_umap_bb_sizes, Africa_umap_bb_sizes)).flatten()
    
    
    coef_sizes_pca, p_sizes_pca = spearmanr(orig_bb_sizes, pca_bb_sizes)
    coef_sizes_tsne, p_sizes_tsne = spearmanr(orig_bb_sizes, tsne_bb_sizes)
    coef_sizes_umap, p_sizes_umap = spearmanr(orig_bb_sizes, umap_bb_sizes)
    
    coef_sizes_pca_list.append(coef_sizes_pca)
    coef_sizes_tsne_list.append(coef_sizes_tsne)
    coef_sizes_umap_list.append(coef_sizes_umap)
    
    figure = plt.figure(figsize = (20, 6))
    plt.subplot(131)
    plt.scatter(orig_bb_sizes, pca_bb_sizes)
    plt.xlabel('Original Sizes'); plt.ylabel('Reconstructed Sizes'); plt.title('PCA')
    plt.subplot(132)
    plt.scatter(orig_bb_sizes, tsne_bb_sizes)
    plt.xlabel('Original Sizes'); plt.ylabel('Reconstructed Sizes'); plt.title('tSNE')
    plt.subplot(133)
    plt.scatter(orig_bb_sizes, umap_bb_sizes)
    plt.xlabel('Original Sizes'); plt.ylabel('Reconstructed Sizes'); plt.title('UMAP')
    figure.tight_layout()
    plt.show()
    
    print('Spearman correlation coeffcient for PCA dimension reduction = {}'.format(coef_sizes_pca))
    print('Spearman correlation coeffcient for tSNE dimension reduction = {}'.format(coef_sizes_tsne))
    print('Spearman correlation coeffcient for UMAP dimension reduction = {}'.format(coef_sizes_umap))
    print('****************************************************************************')
Working with iteration 1
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.8424242424242423
****************************************************************************
Working with iteration 2
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.9272727272727272
****************************************************************************
Working with iteration 3
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.8303030303030302
****************************************************************************
Working with iteration 4
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.8181818181818182
****************************************************************************
Working with iteration 5
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.9878787878787878
****************************************************************************
Working with iteration 6
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.9393939393939393
****************************************************************************
Working with iteration 7
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.7939393939393938
****************************************************************************
Working with iteration 8
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.8909090909090909
****************************************************************************
Working with iteration 9
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.9272727272727272
****************************************************************************
Working with iteration 10
Spearman correlation coeffcient for PCA dimension reduction = 0.9999999999999999
Spearman correlation coeffcient for tSNE dimension reduction = 0.8181818181818182
Spearman correlation coeffcient for UMAP dimension reduction = 0.7454545454545454
****************************************************************************
In [227]:
import matplotlib
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
plt.boxplot([coef_sizes_pca_list, coef_sizes_umap_list, coef_sizes_tsne_list], 
            labels = ['PCA', 'UMAP', 'tSNE'], patch_artist = True)
plt.title('Preserving shapes of clusters by dimension reduction algorithms', fontsize = 22)
plt.ylabel('Spearman Correlation coefficient', fontsize = 24)
plt.show()
In [228]:
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(coef_sizes_umap_list, coef_sizes_tsne_list)
print('Statistics = %.3f, p = %.3f' % (stat, p))
Statistics = 25.000, p = 0.021

PCA of course demonstrates a perfect correlation between original and reconstructed sizes of the bounding boxes around the clusters. Regarding tSNE vs. UMAP, despite large variation in the UMAP Spearman correlation coefficient between original and reconstructed cluster sizes, we conclude that UMAP significantly better preserves the shapes / sizes of the clusters which is another confirmation of better global structure preservation by UMAP compared to tSNE.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Linear Manifold Without Noise

First let us construct a sintetic data set. We will use the fantastic In-Depth Manifold Learning tutorial and start with a word projected to a linear manifold without added noise, later we are going to project the word onto a non-linear manifold such as S-curve or swiss roll.

To construct the sintetic data set we will use the fantastic In-Depth Manifold Learning tutorial and start with a 2D collection of points.

In [155]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread

N = 10000
fig, ax = plt.subplots(figsize=(10, 1))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
ax.axis('off')
ax.text(0.5, 0.4, 'H E L L O', va = 'center', ha = 'center', weight = 'bold', size = 85)
fig.savefig('hello.png')
plt.close(fig)
    
data = imread('hello.png')[::-1, :, 0].T
rng = np.random.RandomState(123)
X = rng.rand(4 * N, 2)
i, j = (X * data.shape).astype(int).T
mask = (data[i, j] < 1)
X = X[mask]
X[:, 0] *= (data.shape[0] / data.shape[1])
X = X[:N]
X = X[np.argsort(X[:, 0])]
plt.figure(figsize=(20,15))
plt.scatter(X[:, 0], X[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.axis('equal');
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

def make_hello(N=1000, rseed=42):
    # Make a plot with "HELLO" text; save as PNG
    fig, ax = plt.subplots(figsize=(4, 1))
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
    ax.axis('off')
    ax.text(0.5, 0.4, 'HELLO', va='center', ha='center', weight='bold', size=85)
    fig.savefig('hello.png')
    plt.close(fig)
    
    # Open this PNG and draw random points from it
    from matplotlib.image import imread
    data = imread('hello.png')[::-1, :, 0].T
    rng = np.random.RandomState(rseed)
    X = rng.rand(4 * N, 2)
    i, j = (X * data.shape).astype(int).T
    mask = (data[i, j] < 1)
    X = X[mask]
    X[:, 0] *= (data.shape[0] / data.shape[1])
    X = X[:N]
    return X[np.argsort(X[:, 0])]
In [2]:
X = make_hello(1000)
plt.figure(figsize=(20,15))
plt.scatter(X[:, 0], X[:, 1], c=X[:, 0], cmap=plt.cm.get_cmap('rainbow', 5), s = 50)
plt.axis('equal');

This is how the synetic data looks like, it is just a 2D numpy array containing 1000 data points.

In [3]:
X
Out[3]:
array([[4.65390215e-05, 4.16565828e-01],
       [5.38772018e-04, 5.11129139e-01],
       [2.61356305e-03, 8.70669034e-01],
       ...,
       [3.99099756e+00, 4.51739476e-01],
       [3.99173644e+00, 3.50711815e-01],
       [3.99557045e+00, 3.26639249e-01]])

Now let us perform the linear dimension reduction with the Principal Component Analysis (PCA) and Multi-Dimensional Scaling (MDS) and see whether PCA and MDS will able to reconstruct the initial data set via the linear matrix decomposition.

In [4]:
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 20)
plt.xlabel("PCA1", fontsize = 20)
plt.ylabel("PCA2", fontsize = 20)
plt.show()
In [5]:
from sklearn.manifold import MDS
model_mds = MDS(n_components = 2, random_state = 123, metric = True)
mmds = model_mds.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(mmds[:, 0], mmds[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('Metric Multi-Dimensional Scaling (MMDS)', fontsize = 20)
plt.xlabel("MMDS1", fontsize = 20)
plt.ylabel("MMDS2", fontsize = 20)
plt.show()

We conclude that the PCA and MDS perfectly reconstructed the original data, this is not surprising because it is a 2D data and a linear manifold, i.e. only rotations, scaling and translations. For comparision we will check the Laplacian Eigenmaps dimension reduction plot.

In [6]:
from sklearn.manifold import SpectralEmbedding
model = SpectralEmbedding(n_components = 2)
se = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(se[:, 0], se[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('Laplacian Eigenmap', fontsize = 20)
plt.xlabel("LAP1", fontsize = 20)
plt.ylabel("LAP2", fontsize = 20)
plt.show()

The Laplacian eigenmaps tend to group points from each cluster / letter together into a single point which explains why Spectral Clustering is such a powerful and popular technique. Since the Laplacian Eigenmaps basically produces very tighthly packed clusters, it is very easy to run any clustering algorithm, even K-means, on the top of the dimensions reduction.

Now it is time to run a non-linear dimension reduction such as tSNE. Here we will use a very large perplexity value, $\sqrt{N} \approx 30$ should be enough for a good balance between local and global structure. We will start with moderate perplexity = 30 and later we will use perplexity = 1000 in order to aim to fully reconstruct the origianl data set.

In [108]:
from sklearn.manifold import TSNE
X_reduced = PCA(n_components = 2).fit_transform(X)
model = TSNE(learning_rate = 1, n_components = 2, random_state = 123, perplexity = 30, 
             init = X_reduced, n_iter = 10000, verbose = 2, early_exaggeration = 1)
tsne = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('tSNE', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1000 samples in 0.002s...
[t-SNE] Computed neighbors for 1000 samples in 0.012s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1000
[t-SNE] Mean sigma: 0.102819
[t-SNE] Computed conditional probabilities in 0.055s
[t-SNE] Iteration 50: error = 2.3763542, gradient norm = 0.0218554 (50 iterations in 0.267s)
[t-SNE] Iteration 100: error = 1.8734579, gradient norm = 0.0148718 (50 iterations in 0.208s)
[t-SNE] Iteration 150: error = 1.4898815, gradient norm = 0.0100922 (50 iterations in 0.197s)
[t-SNE] Iteration 200: error = 1.2309555, gradient norm = 0.0071857 (50 iterations in 0.198s)
[t-SNE] Iteration 250: error = 1.0531418, gradient norm = 0.0053980 (50 iterations in 0.175s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 1.053142
[t-SNE] Iteration 300: error = 1.0176998, gradient norm = 0.0050624 (50 iterations in 0.187s)
[t-SNE] Iteration 350: error = 0.9356309, gradient norm = 0.0043211 (50 iterations in 0.172s)
[t-SNE] Iteration 400: error = 0.8396809, gradient norm = 0.0035091 (50 iterations in 0.178s)
[t-SNE] Iteration 450: error = 0.7516401, gradient norm = 0.0028227 (50 iterations in 0.180s)
[t-SNE] Iteration 500: error = 0.6784452, gradient norm = 0.0022709 (50 iterations in 0.185s)
[t-SNE] Iteration 550: error = 0.6191748, gradient norm = 0.0018671 (50 iterations in 0.176s)
[t-SNE] Iteration 600: error = 0.5711434, gradient norm = 0.0015661 (50 iterations in 0.187s)
[t-SNE] Iteration 650: error = 0.5317484, gradient norm = 0.0013260 (50 iterations in 0.177s)
[t-SNE] Iteration 700: error = 0.4996980, gradient norm = 0.0011303 (50 iterations in 0.198s)
[t-SNE] Iteration 750: error = 0.4730405, gradient norm = 0.0009810 (50 iterations in 0.194s)
[t-SNE] Iteration 800: error = 0.4507212, gradient norm = 0.0008631 (50 iterations in 0.200s)
[t-SNE] Iteration 850: error = 0.4316178, gradient norm = 0.0007750 (50 iterations in 0.203s)
[t-SNE] Iteration 900: error = 0.4153228, gradient norm = 0.0006932 (50 iterations in 0.203s)
[t-SNE] Iteration 950: error = 0.4009400, gradient norm = 0.0006239 (50 iterations in 0.200s)
[t-SNE] Iteration 1000: error = 0.3884377, gradient norm = 0.0005716 (50 iterations in 0.200s)
[t-SNE] Iteration 1050: error = 0.3771432, gradient norm = 0.0005232 (50 iterations in 0.196s)
[t-SNE] Iteration 1100: error = 0.3671872, gradient norm = 0.0004870 (50 iterations in 0.200s)
[t-SNE] Iteration 1150: error = 0.3581590, gradient norm = 0.0004499 (50 iterations in 0.186s)
[t-SNE] Iteration 1200: error = 0.3500463, gradient norm = 0.0004082 (50 iterations in 0.232s)
[t-SNE] Iteration 1250: error = 0.3429423, gradient norm = 0.0003872 (50 iterations in 0.189s)
[t-SNE] Iteration 1300: error = 0.3365152, gradient norm = 0.0003580 (50 iterations in 0.196s)
[t-SNE] Iteration 1350: error = 0.3306703, gradient norm = 0.0003313 (50 iterations in 0.191s)
[t-SNE] Iteration 1400: error = 0.3253272, gradient norm = 0.0003075 (50 iterations in 0.189s)
[t-SNE] Iteration 1450: error = 0.3205184, gradient norm = 0.0002836 (50 iterations in 0.185s)
[t-SNE] Iteration 1500: error = 0.3162664, gradient norm = 0.0002699 (50 iterations in 0.199s)
[t-SNE] Iteration 1550: error = 0.3123856, gradient norm = 0.0002592 (50 iterations in 0.188s)
[t-SNE] Iteration 1600: error = 0.3087282, gradient norm = 0.0002434 (50 iterations in 0.201s)
[t-SNE] Iteration 1650: error = 0.3053071, gradient norm = 0.0002363 (50 iterations in 0.187s)
[t-SNE] Iteration 1700: error = 0.3021984, gradient norm = 0.0002256 (50 iterations in 0.200s)
[t-SNE] Iteration 1750: error = 0.2993464, gradient norm = 0.0002142 (50 iterations in 0.197s)
[t-SNE] Iteration 1800: error = 0.2967884, gradient norm = 0.0002091 (50 iterations in 0.199s)
[t-SNE] Iteration 1850: error = 0.2944146, gradient norm = 0.0001933 (50 iterations in 0.189s)
[t-SNE] Iteration 1900: error = 0.2923997, gradient norm = 0.0001894 (50 iterations in 0.203s)
[t-SNE] Iteration 1950: error = 0.2903039, gradient norm = 0.0001848 (50 iterations in 0.186s)
[t-SNE] Iteration 2000: error = 0.2883886, gradient norm = 0.0001762 (50 iterations in 0.206s)
[t-SNE] Iteration 2050: error = 0.2865587, gradient norm = 0.0001754 (50 iterations in 0.188s)
[t-SNE] Iteration 2100: error = 0.2849835, gradient norm = 0.0001675 (50 iterations in 0.202s)
[t-SNE] Iteration 2150: error = 0.2833104, gradient norm = 0.0001592 (50 iterations in 0.205s)
[t-SNE] Iteration 2200: error = 0.2818873, gradient norm = 0.0001567 (50 iterations in 0.209s)
[t-SNE] Iteration 2250: error = 0.2806260, gradient norm = 0.0001545 (50 iterations in 0.242s)
[t-SNE] Iteration 2300: error = 0.2793276, gradient norm = 0.0001463 (50 iterations in 0.220s)
[t-SNE] Iteration 2350: error = 0.2781130, gradient norm = 0.0001440 (50 iterations in 0.203s)
[t-SNE] Iteration 2400: error = 0.2770164, gradient norm = 0.0001411 (50 iterations in 0.197s)
[t-SNE] Iteration 2450: error = 0.2759888, gradient norm = 0.0001373 (50 iterations in 0.203s)
[t-SNE] Iteration 2500: error = 0.2747658, gradient norm = 0.0001382 (50 iterations in 0.202s)
[t-SNE] Iteration 2550: error = 0.2738340, gradient norm = 0.0001418 (50 iterations in 0.197s)
[t-SNE] Iteration 2600: error = 0.2728610, gradient norm = 0.0001327 (50 iterations in 0.201s)
[t-SNE] Iteration 2650: error = 0.2719152, gradient norm = 0.0001340 (50 iterations in 0.187s)
[t-SNE] Iteration 2700: error = 0.2709783, gradient norm = 0.0001331 (50 iterations in 0.191s)
[t-SNE] Iteration 2750: error = 0.2701202, gradient norm = 0.0001359 (50 iterations in 0.196s)
[t-SNE] Iteration 2800: error = 0.2693513, gradient norm = 0.0001344 (50 iterations in 0.193s)
[t-SNE] Iteration 2850: error = 0.2685887, gradient norm = 0.0001272 (50 iterations in 0.187s)
[t-SNE] Iteration 2900: error = 0.2678534, gradient norm = 0.0001260 (50 iterations in 0.199s)
[t-SNE] Iteration 2950: error = 0.2670863, gradient norm = 0.0001261 (50 iterations in 0.188s)
[t-SNE] Iteration 3000: error = 0.2663991, gradient norm = 0.0001255 (50 iterations in 0.201s)
[t-SNE] Iteration 3050: error = 0.2656491, gradient norm = 0.0001262 (50 iterations in 0.190s)
[t-SNE] Iteration 3100: error = 0.2649826, gradient norm = 0.0001217 (50 iterations in 0.200s)
[t-SNE] Iteration 3150: error = 0.2643894, gradient norm = 0.0001256 (50 iterations in 0.194s)
[t-SNE] Iteration 3200: error = 0.2637620, gradient norm = 0.0001238 (50 iterations in 0.198s)
[t-SNE] Iteration 3250: error = 0.2632249, gradient norm = 0.0001276 (50 iterations in 0.197s)
[t-SNE] Iteration 3300: error = 0.2627140, gradient norm = 0.0001190 (50 iterations in 0.198s)
[t-SNE] Iteration 3350: error = 0.2622620, gradient norm = 0.0001198 (50 iterations in 0.193s)
[t-SNE] Iteration 3400: error = 0.2617540, gradient norm = 0.0001160 (50 iterations in 0.202s)
[t-SNE] Iteration 3450: error = 0.2612059, gradient norm = 0.0001156 (50 iterations in 0.194s)
[t-SNE] Iteration 3500: error = 0.2606744, gradient norm = 0.0001168 (50 iterations in 0.203s)
[t-SNE] Iteration 3550: error = 0.2602693, gradient norm = 0.0001132 (50 iterations in 0.198s)
[t-SNE] Iteration 3600: error = 0.2596899, gradient norm = 0.0001179 (50 iterations in 0.211s)
[t-SNE] Iteration 3650: error = 0.2592912, gradient norm = 0.0001130 (50 iterations in 0.196s)
[t-SNE] Iteration 3700: error = 0.2588634, gradient norm = 0.0001072 (50 iterations in 0.203s)
[t-SNE] Iteration 3750: error = 0.2584592, gradient norm = 0.0001115 (50 iterations in 0.201s)
[t-SNE] Iteration 3800: error = 0.2579952, gradient norm = 0.0001146 (50 iterations in 0.202s)
[t-SNE] Iteration 3850: error = 0.2575344, gradient norm = 0.0001200 (50 iterations in 0.204s)
[t-SNE] Iteration 3900: error = 0.2570644, gradient norm = 0.0001234 (50 iterations in 0.204s)
[t-SNE] Iteration 3950: error = 0.2565946, gradient norm = 0.0001198 (50 iterations in 0.204s)
[t-SNE] Iteration 4000: error = 0.2561679, gradient norm = 0.0001123 (50 iterations in 0.210s)
[t-SNE] Iteration 4050: error = 0.2557034, gradient norm = 0.0001100 (50 iterations in 0.208s)
[t-SNE] Iteration 4100: error = 0.2552296, gradient norm = 0.0001084 (50 iterations in 0.210s)
[t-SNE] Iteration 4150: error = 0.2547871, gradient norm = 0.0001096 (50 iterations in 0.213s)
[t-SNE] Iteration 4200: error = 0.2542860, gradient norm = 0.0001086 (50 iterations in 0.201s)
[t-SNE] Iteration 4250: error = 0.2537715, gradient norm = 0.0001086 (50 iterations in 0.210s)
[t-SNE] Iteration 4300: error = 0.2531673, gradient norm = 0.0001155 (50 iterations in 0.198s)
[t-SNE] Iteration 4350: error = 0.2526192, gradient norm = 0.0001158 (50 iterations in 0.211s)
[t-SNE] Iteration 4400: error = 0.2520021, gradient norm = 0.0001148 (50 iterations in 0.197s)
[t-SNE] Iteration 4450: error = 0.2512997, gradient norm = 0.0001161 (50 iterations in 0.208s)
[t-SNE] Iteration 4500: error = 0.2505736, gradient norm = 0.0001247 (50 iterations in 0.213s)
[t-SNE] Iteration 4550: error = 0.2496718, gradient norm = 0.0001393 (50 iterations in 0.209s)
[t-SNE] Iteration 4600: error = 0.2488306, gradient norm = 0.0001377 (50 iterations in 0.225s)
[t-SNE] Iteration 4650: error = 0.2480509, gradient norm = 0.0001179 (50 iterations in 0.208s)
[t-SNE] Iteration 4700: error = 0.2476044, gradient norm = 0.0001183 (50 iterations in 0.244s)
[t-SNE] Iteration 4750: error = 0.2474410, gradient norm = 0.0001139 (50 iterations in 0.246s)
[t-SNE] Iteration 4800: error = 0.2471498, gradient norm = 0.0001064 (50 iterations in 0.241s)
[t-SNE] Iteration 4850: error = 0.2469043, gradient norm = 0.0000994 (50 iterations in 0.220s)
[t-SNE] Iteration 4900: error = 0.2467050, gradient norm = 0.0001018 (50 iterations in 0.216s)
[t-SNE] Iteration 4950: error = 0.2464422, gradient norm = 0.0001042 (50 iterations in 0.216s)
[t-SNE] Iteration 5000: error = 0.2462123, gradient norm = 0.0001145 (50 iterations in 0.211s)
[t-SNE] Iteration 5050: error = 0.2459694, gradient norm = 0.0001038 (50 iterations in 0.214s)
[t-SNE] Iteration 5100: error = 0.2457146, gradient norm = 0.0000983 (50 iterations in 0.203s)
[t-SNE] Iteration 5150: error = 0.2454324, gradient norm = 0.0000942 (50 iterations in 0.207s)
[t-SNE] Iteration 5200: error = 0.2451836, gradient norm = 0.0000991 (50 iterations in 0.213s)
[t-SNE] Iteration 5250: error = 0.2449707, gradient norm = 0.0000956 (50 iterations in 0.211s)
[t-SNE] Iteration 5300: error = 0.2447550, gradient norm = 0.0000969 (50 iterations in 0.212s)
[t-SNE] Iteration 5350: error = 0.2445012, gradient norm = 0.0000954 (50 iterations in 0.206s)
[t-SNE] Iteration 5400: error = 0.2442502, gradient norm = 0.0000961 (50 iterations in 0.214s)
[t-SNE] Iteration 5450: error = 0.2440316, gradient norm = 0.0000915 (50 iterations in 0.206s)
[t-SNE] Iteration 5500: error = 0.2437157, gradient norm = 0.0000920 (50 iterations in 0.212s)
[t-SNE] Iteration 5550: error = 0.2434583, gradient norm = 0.0000929 (50 iterations in 0.214s)
[t-SNE] Iteration 5600: error = 0.2432147, gradient norm = 0.0000902 (50 iterations in 0.207s)
[t-SNE] Iteration 5650: error = 0.2429342, gradient norm = 0.0000880 (50 iterations in 0.216s)
[t-SNE] Iteration 5700: error = 0.2425876, gradient norm = 0.0000988 (50 iterations in 0.213s)
[t-SNE] Iteration 5750: error = 0.2422452, gradient norm = 0.0000954 (50 iterations in 0.218s)
[t-SNE] Iteration 5800: error = 0.2419497, gradient norm = 0.0000920 (50 iterations in 0.204s)
[t-SNE] Iteration 5850: error = 0.2416679, gradient norm = 0.0000921 (50 iterations in 0.212s)
[t-SNE] Iteration 5900: error = 0.2413297, gradient norm = 0.0000962 (50 iterations in 0.216s)
[t-SNE] Iteration 5950: error = 0.2410631, gradient norm = 0.0000969 (50 iterations in 0.208s)
[t-SNE] Iteration 6000: error = 0.2407261, gradient norm = 0.0001019 (50 iterations in 0.213s)
[t-SNE] Iteration 6050: error = 0.2404214, gradient norm = 0.0000999 (50 iterations in 0.205s)
[t-SNE] Iteration 6100: error = 0.2401297, gradient norm = 0.0001015 (50 iterations in 0.209s)
[t-SNE] Iteration 6150: error = 0.2398682, gradient norm = 0.0000999 (50 iterations in 0.217s)
[t-SNE] Iteration 6200: error = 0.2395476, gradient norm = 0.0001091 (50 iterations in 0.212s)
[t-SNE] Iteration 6250: error = 0.2392312, gradient norm = 0.0001090 (50 iterations in 0.223s)
[t-SNE] Iteration 6300: error = 0.2387965, gradient norm = 0.0001056 (50 iterations in 0.207s)
[t-SNE] Iteration 6350: error = 0.2382746, gradient norm = 0.0001083 (50 iterations in 0.223s)
[t-SNE] Iteration 6400: error = 0.2377098, gradient norm = 0.0001119 (50 iterations in 0.208s)
[t-SNE] Iteration 6450: error = 0.2371763, gradient norm = 0.0001210 (50 iterations in 0.222s)
[t-SNE] Iteration 6500: error = 0.2365888, gradient norm = 0.0001220 (50 iterations in 0.215s)
[t-SNE] Iteration 6550: error = 0.2359422, gradient norm = 0.0001310 (50 iterations in 0.206s)
[t-SNE] Iteration 6600: error = 0.2350299, gradient norm = 0.0001450 (50 iterations in 0.212s)
[t-SNE] Iteration 6650: error = 0.2340196, gradient norm = 0.0001626 (50 iterations in 0.216s)
[t-SNE] Iteration 6700: error = 0.2328023, gradient norm = 0.0001738 (50 iterations in 0.217s)
[t-SNE] Iteration 6750: error = 0.2317577, gradient norm = 0.0001311 (50 iterations in 0.214s)
[t-SNE] Iteration 6800: error = 0.2311447, gradient norm = 0.0001190 (50 iterations in 0.216s)
[t-SNE] Iteration 6850: error = 0.2307127, gradient norm = 0.0001243 (50 iterations in 0.213s)
[t-SNE] Iteration 6900: error = 0.2304579, gradient norm = 0.0001133 (50 iterations in 0.215s)
[t-SNE] Iteration 6950: error = 0.2301681, gradient norm = 0.0001131 (50 iterations in 0.218s)
[t-SNE] Iteration 7000: error = 0.2298557, gradient norm = 0.0001135 (50 iterations in 0.213s)
[t-SNE] Iteration 7050: error = 0.2295498, gradient norm = 0.0001084 (50 iterations in 0.218s)
[t-SNE] Iteration 7100: error = 0.2293987, gradient norm = 0.0001008 (50 iterations in 0.220s)
[t-SNE] Iteration 7150: error = 0.2291743, gradient norm = 0.0000998 (50 iterations in 0.212s)
[t-SNE] Iteration 7200: error = 0.2290021, gradient norm = 0.0000911 (50 iterations in 0.221s)
[t-SNE] Iteration 7250: error = 0.2287240, gradient norm = 0.0000888 (50 iterations in 0.208s)
[t-SNE] Iteration 7300: error = 0.2285697, gradient norm = 0.0000844 (50 iterations in 0.218s)
[t-SNE] Iteration 7350: error = 0.2284353, gradient norm = 0.0000844 (50 iterations in 0.209s)
[t-SNE] Iteration 7400: error = 0.2282457, gradient norm = 0.0000895 (50 iterations in 0.217s)
[t-SNE] Iteration 7450: error = 0.2281063, gradient norm = 0.0000932 (50 iterations in 0.212s)
[t-SNE] Iteration 7500: error = 0.2279743, gradient norm = 0.0000894 (50 iterations in 0.216s)
[t-SNE] Iteration 7550: error = 0.2278584, gradient norm = 0.0000858 (50 iterations in 0.224s)
[t-SNE] Iteration 7600: error = 0.2277781, gradient norm = 0.0000873 (50 iterations in 0.216s)
[t-SNE] Iteration 7650: error = 0.2276019, gradient norm = 0.0000848 (50 iterations in 0.218s)
[t-SNE] Iteration 7700: error = 0.2274703, gradient norm = 0.0000844 (50 iterations in 0.207s)
[t-SNE] Iteration 7750: error = 0.2273785, gradient norm = 0.0000805 (50 iterations in 0.217s)
[t-SNE] Iteration 7800: error = 0.2271956, gradient norm = 0.0000816 (50 iterations in 0.211s)
[t-SNE] Iteration 7850: error = 0.2270120, gradient norm = 0.0000742 (50 iterations in 0.217s)
[t-SNE] Iteration 7900: error = 0.2268886, gradient norm = 0.0000755 (50 iterations in 0.214s)
[t-SNE] Iteration 7950: error = 0.2267495, gradient norm = 0.0000779 (50 iterations in 0.208s)
[t-SNE] Iteration 8000: error = 0.2266221, gradient norm = 0.0000776 (50 iterations in 0.219s)
[t-SNE] Iteration 8050: error = 0.2265399, gradient norm = 0.0000748 (50 iterations in 0.212s)
[t-SNE] Iteration 8100: error = 0.2264306, gradient norm = 0.0000736 (50 iterations in 0.213s)
[t-SNE] Iteration 8150: error = 0.2263470, gradient norm = 0.0000716 (50 iterations in 0.209s)
[t-SNE] Iteration 8200: error = 0.2261671, gradient norm = 0.0000693 (50 iterations in 0.220s)
[t-SNE] Iteration 8250: error = 0.2260303, gradient norm = 0.0000688 (50 iterations in 0.220s)
[t-SNE] Iteration 8300: error = 0.2259429, gradient norm = 0.0000706 (50 iterations in 0.216s)
[t-SNE] Iteration 8350: error = 0.2258578, gradient norm = 0.0000740 (50 iterations in 0.221s)
[t-SNE] Iteration 8400: error = 0.2257510, gradient norm = 0.0000753 (50 iterations in 0.233s)
[t-SNE] Iteration 8450: error = 0.2256668, gradient norm = 0.0000764 (50 iterations in 0.239s)
[t-SNE] Iteration 8500: error = 0.2255780, gradient norm = 0.0000736 (50 iterations in 0.225s)
[t-SNE] Iteration 8550: error = 0.2254093, gradient norm = 0.0000721 (50 iterations in 0.231s)
[t-SNE] Iteration 8600: error = 0.2252970, gradient norm = 0.0000718 (50 iterations in 0.222s)
[t-SNE] Iteration 8650: error = 0.2251626, gradient norm = 0.0000717 (50 iterations in 0.219s)
[t-SNE] Iteration 8700: error = 0.2250918, gradient norm = 0.0000737 (50 iterations in 0.213s)
[t-SNE] Iteration 8750: error = 0.2250383, gradient norm = 0.0000736 (50 iterations in 0.215s)
[t-SNE] Iteration 8800: error = 0.2249451, gradient norm = 0.0000755 (50 iterations in 0.210s)
[t-SNE] Iteration 8850: error = 0.2248710, gradient norm = 0.0000716 (50 iterations in 0.218s)
[t-SNE] Iteration 8900: error = 0.2247003, gradient norm = 0.0000715 (50 iterations in 0.212s)
[t-SNE] Iteration 8950: error = 0.2246633, gradient norm = 0.0000680 (50 iterations in 0.213s)
[t-SNE] Iteration 9000: error = 0.2245297, gradient norm = 0.0000677 (50 iterations in 0.210s)
[t-SNE] Iteration 9050: error = 0.2243963, gradient norm = 0.0000707 (50 iterations in 0.211s)
[t-SNE] Iteration 9100: error = 0.2243879, gradient norm = 0.0000703 (50 iterations in 0.211s)
[t-SNE] Iteration 9150: error = 0.2243113, gradient norm = 0.0000687 (50 iterations in 0.218s)
[t-SNE] Iteration 9200: error = 0.2242603, gradient norm = 0.0000717 (50 iterations in 0.221s)
[t-SNE] Iteration 9250: error = 0.2241800, gradient norm = 0.0000707 (50 iterations in 0.214s)
[t-SNE] Iteration 9300: error = 0.2239987, gradient norm = 0.0000711 (50 iterations in 0.205s)
[t-SNE] Iteration 9350: error = 0.2240290, gradient norm = 0.0000691 (50 iterations in 0.210s)
[t-SNE] Iteration 9400: error = 0.2238861, gradient norm = 0.0000701 (50 iterations in 0.222s)
[t-SNE] Iteration 9450: error = 0.2238231, gradient norm = 0.0000694 (50 iterations in 0.214s)
[t-SNE] Iteration 9500: error = 0.2237988, gradient norm = 0.0000660 (50 iterations in 0.222s)
[t-SNE] Iteration 9550: error = 0.2237818, gradient norm = 0.0000691 (50 iterations in 0.207s)
[t-SNE] Iteration 9600: error = 0.2237437, gradient norm = 0.0000672 (50 iterations in 0.211s)
[t-SNE] Iteration 9650: error = 0.2236365, gradient norm = 0.0000688 (50 iterations in 0.213s)
[t-SNE] Iteration 9700: error = 0.2235557, gradient norm = 0.0000689 (50 iterations in 0.216s)
[t-SNE] Iteration 9750: error = 0.2234234, gradient norm = 0.0000657 (50 iterations in 0.213s)
[t-SNE] Iteration 9800: error = 0.2233857, gradient norm = 0.0000628 (50 iterations in 0.208s)
[t-SNE] Iteration 9850: error = 0.2233446, gradient norm = 0.0000616 (50 iterations in 0.214s)
[t-SNE] Iteration 9900: error = 0.2233019, gradient norm = 0.0000647 (50 iterations in 0.211s)
[t-SNE] Iteration 9950: error = 0.2232545, gradient norm = 0.0000622 (50 iterations in 0.208s)
[t-SNE] Iteration 10000: error = 0.2231750, gradient norm = 0.0000621 (50 iterations in 0.205s)
[t-SNE] KL divergence after 10000 iterations: 0.223175

Looks like tSNE almost reconstructed the original data. Let us check how UMAP performs on this data set:

In [117]:
import warnings
warnings.filterwarnings("ignore")

from umap import UMAP
X_reduced = PCA(n_components = 2).fit_transform(X)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 998, 
             init = X_reduced, n_epochs = 10000, verbose = 2)
umap = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
plt.show()
UMAP(a=None, angular_rp_forest=False, b=None,
   init=array([[-1.91504, -0.08113],
       [-1.9149 ,  0.01343],
       ...,
       [ 2.07686, -0.13213],
       [ 2.08079, -0.15619]]),
   learning_rate=1, local_connectivity=1.0, metric='euclidean',
   metric_kwds=None, min_dist=1, n_components=2, n_epochs=10000,
   n_neighbors=998, negative_sample_rate=5, random_state=None,
   repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
   target_metric='categorical', target_metric_kwds=None,
   target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
   transform_seed=42, verbose=2)
Construct fuzzy simplicial set
Mon Jan 20 20:01:22 2020 Finding Nearest Neighbors
Mon Jan 20 20:01:22 2020 Finished Nearest Neighbor Search
Mon Jan 20 20:01:22 2020 Construct embedding
	completed  0  /  10000 epochs
	completed  1000  /  10000 epochs
	completed  2000  /  10000 epochs
	completed  3000  /  10000 epochs
	completed  4000  /  10000 epochs
	completed  5000  /  10000 epochs
	completed  6000  /  10000 epochs
	completed  7000  /  10000 epochs
	completed  8000  /  10000 epochs
	completed  9000  /  10000 epochs
Mon Jan 20 20:03:09 2020 Finished embedding

We conclude that both tSNE and UMAP can reconstruct the original data. To achieve this, tSNE needs a large learning rate and perplexity close to the size of the data set, while UMAP needs a small learning rate, min_dist = 1 and large (almost infinite) n_neighbors.

Contribution from Cost Function

Here we will plot derivative of tSNE cost function that is the KL-divergence, and UMAP cost function that is the Cross-Entropy (CE).

In [127]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

N_LOW_DIMS = 2
MAX_ITER = 1000
PERPLEXITY = 200
LEARNING_RATE = 0.6


X_train = X; n = X_train.shape[0]
y_train = X[:, 0]
dist = np.square(euclidean_distances(X_train, X_train))

plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
    #return -np.sum([p*np.log2(p) for p in prob if p!=0])

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
    func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
    binary_search_result = sigma_binary_search(func, PERPLEXITY)
    prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
    sigma_array.append(binary_search_result)
    if (dist_row + 1) % 100 == 0:
        print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))

plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
sns.distplot(sigma_array)
plt.title("Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

P = prob + np.transpose(prob)

def prob_low_dim(Y):
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    np.fill_diagonal(inv_distances, 0.)
    return inv_distances / np.sum(inv_distances, axis = 1, keepdims = True)

def KL(P, Y):
    Q = prob_low_dim(Y)
    return P * np.log(P + 0.01) - P * np.log(Q + 0.01)

def KL_gradient(P, Y):
    Q = prob_low_dim(Y)
    y_diff = np.expand_dims(Y, 1) - np.expand_dims(Y, 0)
    inv_dist = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    return 4*np.sum(np.expand_dims(P - Q, 2) * y_diff * np.expand_dims(inv_dist, 2), axis = 1)

np.random.seed(12345)
#y = np.random.normal(loc = 0, scale = 1, size = (n, N_LOW_DIMS))
y = X_reduced
KL_array = []; KL_gradient_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
    y = y - LEARNING_RATE * KL_gradient(P, y)
    KL_array.append(np.sum(KL(P, y)))
    KL_gradient_array.append(np.sum(KL_gradient(P, y)))
    if i % 100 == 0:
        print("KL divergence = " + str(np.sum(KL(P, y))))
        
plt.figure(figsize=(20,15))
plt.plot(KL_array,'-o')
plt.title("KL-divergence", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE", fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
plt.plot(KL_gradient_array,'-o')
plt.title("KL-divergence Gradient", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE GRADIENT", fontsize = 20)
plt.show()
Sigma binary search finished 100 of 1000 cells
Sigma binary search finished 200 of 1000 cells
Sigma binary search finished 300 of 1000 cells
Sigma binary search finished 400 of 1000 cells
Sigma binary search finished 500 of 1000 cells
Sigma binary search finished 600 of 1000 cells
Sigma binary search finished 700 of 1000 cells
Sigma binary search finished 800 of 1000 cells
Sigma binary search finished 900 of 1000 cells
Sigma binary search finished 1000 of 1000 cells

Mean sigma = 0.2828330993652344
Running Gradient Descent: 

KL divergence = 1114.6888750614978
KL divergence = 928.3283356025705
KL divergence = 924.7441511280074
KL divergence = 924.2842798588231
KL divergence = 924.0131153561099
KL divergence = 924.0128633161303
KL divergence = 924.0077617316904
KL divergence = 924.009719221748
KL divergence = 924.0147028515823
KL divergence = 923.9999497047991
In [158]:
plt.figure(figsize=(20,15))
sns.distplot(-np.sum(prob*np.log2(prob + 0.00001), axis=0))
#sns.distplot(np.power(2, -np.sum(prob*np.log2(prob+0.00001),axis=0)))
plt.show()
In [112]:
plt.figure(figsize=(20,15))
plt.scatter(y[:,0], y[:,1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title("tSNE on a syntetic data set", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()

Let us figure out how sigma is connected with perplexity for tSNE:

In [1]:
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from sklearn.metrics.pairwise import euclidean_distances

path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
                1110020A21Rik  1110046J04Rik  1190002F15Rik  1500015A07Rik
SS2_15_0048_A3            0.0            0.0            0.0            0.0
SS2_15_0048_A6            0.0            0.0            0.0            0.0
SS2_15_0048_A5            0.0            0.0            0.0            0.0
SS2_15_0048_A4            0.0            0.0            0.0            0.0

This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)
In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances


my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(4,724,10):

    #X_train = X; n = X_train.shape[0]
    #y_train = X[:, 0]
    dist = np.square(euclidean_distances(X_train, X_train))

    def prob_high_dim(sigma, dist_row):
        exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
        exp_distance[dist_row] = 0
        prob_not_symmetr = exp_distance / np.sum(exp_distance)
        return prob_not_symmetr

    def perplexity(prob):
        return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

    def sigma_binary_search(perp_of_sigma, fixed_perplexity):
        sigma_lower_limit = 0; sigma_upper_limit = 1000
        for i in range(20):
            approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
            if perp_of_sigma(approx_sigma) < fixed_perplexity:
                sigma_lower_limit = approx_sigma
            else:
                sigma_upper_limit = approx_sigma
            if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
                break
        return approx_sigma

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, PERPLEXITY)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
    
    my_perp.append(PERPLEXITY)
    my_sigma_tSNE.append(np.mean(sigma_array))
Perplexity = 4, Mean Sigma = 4.2812251511898785
Perplexity = 14, Mean Sigma = 5.594089710512641
Perplexity = 24, Mean Sigma = 6.1362042773369305
Perplexity = 34, Mean Sigma = 6.5150474036872055
Perplexity = 44, Mean Sigma = 6.823305311149725
Perplexity = 54, Mean Sigma = 7.09219884606047
Perplexity = 64, Mean Sigma = 7.335841322744359
Perplexity = 74, Mean Sigma = 7.5612361204690774
Perplexity = 84, Mean Sigma = 7.772120683552832
Perplexity = 94, Mean Sigma = 7.97082325599713
Perplexity = 104, Mean Sigma = 8.159072705487299
Perplexity = 114, Mean Sigma = 8.33857658855076
Perplexity = 124, Mean Sigma = 8.510987851872791
Perplexity = 134, Mean Sigma = 8.677623791401613
Perplexity = 144, Mean Sigma = 8.839874960190757
Perplexity = 154, Mean Sigma = 8.998764293819832
Perplexity = 164, Mean Sigma = 9.155228151289444
Perplexity = 174, Mean Sigma = 9.310056377389577
Perplexity = 184, Mean Sigma = 9.464010846015462
Perplexity = 194, Mean Sigma = 9.617571058220037
Perplexity = 204, Mean Sigma = 9.771363029266869
Perplexity = 214, Mean Sigma = 9.925932857577362
Perplexity = 224, Mean Sigma = 10.081725413572855
Perplexity = 234, Mean Sigma = 10.239220198306292
Perplexity = 244, Mean Sigma = 10.39888339335692
Perplexity = 254, Mean Sigma = 10.561165196935558
Perplexity = 264, Mean Sigma = 10.726486504410898
Perplexity = 274, Mean Sigma = 10.895322820993774
Perplexity = 284, Mean Sigma = 11.068109693473943
Perplexity = 294, Mean Sigma = 11.24539455222018
Perplexity = 304, Mean Sigma = 11.427599624548545
Perplexity = 314, Mean Sigma = 11.615313631196262
Perplexity = 324, Mean Sigma = 11.80900408568995
Perplexity = 334, Mean Sigma = 12.009391571556389
Perplexity = 344, Mean Sigma = 12.216984892690649
Perplexity = 354, Mean Sigma = 12.432362114250992
Perplexity = 364, Mean Sigma = 12.656143923711511
Perplexity = 374, Mean Sigma = 12.889050904598982
Perplexity = 384, Mean Sigma = 13.131677105440108
Perplexity = 394, Mean Sigma = 13.384617906708957
Perplexity = 404, Mean Sigma = 13.64850465145857
Perplexity = 414, Mean Sigma = 13.923788869847133
Perplexity = 424, Mean Sigma = 14.210898116980186
Perplexity = 434, Mean Sigma = 14.510338532858055
Perplexity = 444, Mean Sigma = 14.822325892954565
Perplexity = 454, Mean Sigma = 15.147149229848852
Perplexity = 464, Mean Sigma = 15.485172165172726
Perplexity = 474, Mean Sigma = 15.836590495189474
Perplexity = 484, Mean Sigma = 16.2017944804783
Perplexity = 494, Mean Sigma = 16.581487389250174
Perplexity = 504, Mean Sigma = 16.976260606137068
Perplexity = 514, Mean Sigma = 17.387275589244993
Perplexity = 524, Mean Sigma = 17.816004140417004
Perplexity = 534, Mean Sigma = 18.264085886864688
Perplexity = 544, Mean Sigma = 18.733919665800126
Perplexity = 554, Mean Sigma = 19.228242629067193
Perplexity = 564, Mean Sigma = 19.750744270878798
Perplexity = 574, Mean Sigma = 20.30580669807988
Perplexity = 584, Mean Sigma = 20.89899611872668
Perplexity = 594, Mean Sigma = 21.537346546876364
Perplexity = 604, Mean Sigma = 22.230095037534916
Perplexity = 614, Mean Sigma = 22.98900002207836
Perplexity = 624, Mean Sigma = 23.830207366517136
Perplexity = 634, Mean Sigma = 24.77581807354975
Perplexity = 644, Mean Sigma = 25.85742726672295
Perplexity = 654, Mean Sigma = 27.122789255067623
Perplexity = 664, Mean Sigma = 28.647617254843258
Perplexity = 674, Mean Sigma = 30.56285234802928
Perplexity = 684, Mean Sigma = 33.12310826179036
Perplexity = 694, Mean Sigma = 36.92478440993325
Perplexity = 704, Mean Sigma = 43.965078598960154
Perplexity = 714, Mean Sigma = 81.6358787387443
In [3]:
plt.figure(figsize=(20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20)
plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
In [4]:
plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.show()
In [5]:
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.show()
In [171]:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np

N = 1000

perp = np.array([5,10,20,50,100,200,300,400,500,600,700,800,900,920,950,980,990,995,998])
sigma_exact = np.array([0.032,0.05,0.078,0.15,0.24,0.38,0.52,0.68,0.88,1.1,1.35,1.65,2.15,
                        2.31,2.68,3.5,4.26,5.26,7.5])

sigma = lambda perp, a, b, c: ((a*perp) / N) / (1 - c*((perp) / N)**b)
    
p , _ = optimize.curve_fit(sigma, perp, sigma_exact)
print(p)

plt.figure(figsize=(20,15))
plt.plot(perp, sigma_exact, "o")
plt.plot(perp, sigma(perp, p[0], p[1], p[2]), c = "red")
plt.title("Non-Linear Least Square Fit", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("Y", fontsize = 20)
plt.show()
[ 2.22944534 43.71777492  0.75860625]

Let us figure out how sigma is connected with perplexity for UMAP:

In [1]:
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from sklearn.metrics.pairwise import euclidean_distances

path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
                1110020A21Rik  1110046J04Rik  1190002F15Rik  1500015A07Rik
SS2_15_0048_A3            0.0            0.0            0.0            0.0
SS2_15_0048_A6            0.0            0.0            0.0            0.0
SS2_15_0048_A5            0.0            0.0            0.0            0.0
SS2_15_0048_A4            0.0            0.0            0.0            0.0

This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)
In [14]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances


my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(4,724,10):

    #X_train = X; n = X_train.shape[0]
    #y_train = X[:, 0]
    dist = np.square(euclidean_distances(X_train, X_train))
    rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]

    def prob_high_dim(sigma, dist_row):
        d = dist[dist_row] - rho[dist_row]
        d[d < 0] = 0
        return np.exp(- d / sigma)

    def k(prob):
        return np.power(2, np.sum(prob))

    def sigma_binary_search(k_of_sigma, fixed_k):
        sigma_lower_limit = 0; sigma_upper_limit = 1000
        for i in range(20):
            approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
            if k_of_sigma(approx_sigma) < fixed_k:
                sigma_lower_limit = approx_sigma
            else:
                sigma_upper_limit = approx_sigma
            if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
                break
        return approx_sigma

    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: k(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        
    print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
    
    my_n_neighbor.append(N_NEIGHBOR)
    my_sigma_umap.append(np.mean(sigma_array))
N_neighbor = 4, Mean Sigma = 3.3768448749734037
N_neighbor = 14, Mean Sigma = 61.94113752695435
N_neighbor = 24, Mean Sigma = 73.05938971109231
N_neighbor = 34, Mean Sigma = 79.15058349097907
N_neighbor = 44, Mean Sigma = 83.26585332774583
N_neighbor = 54, Mean Sigma = 86.33223192651845
N_neighbor = 64, Mean Sigma = 88.75830213450853
N_neighbor = 74, Mean Sigma = 90.75585823485305
N_neighbor = 84, Mean Sigma = 92.44780034326308
N_neighbor = 94, Mean Sigma = 93.91138140715701
N_neighbor = 104, Mean Sigma = 95.19835024572617
N_neighbor = 114, Mean Sigma = 96.34498244557301
N_neighbor = 124, Mean Sigma = 97.37747341560917
N_neighbor = 134, Mean Sigma = 98.31563587295277
N_neighbor = 144, Mean Sigma = 99.17439695177131
N_neighbor = 154, Mean Sigma = 99.96572419917783
N_neighbor = 164, Mean Sigma = 100.69880805202037
N_neighbor = 174, Mean Sigma = 101.38142308709342
N_neighbor = 184, Mean Sigma = 102.01968827061147
N_neighbor = 194, Mean Sigma = 102.61881418068316
N_neighbor = 204, Mean Sigma = 103.1830883558902
N_neighbor = 214, Mean Sigma = 103.71614168476125
N_neighbor = 224, Mean Sigma = 104.2211801646142
N_neighbor = 234, Mean Sigma = 104.70083039566126
N_neighbor = 244, Mean Sigma = 105.15742328579866
N_neighbor = 254, Mean Sigma = 105.59293810881717
N_neighbor = 264, Mean Sigma = 106.00926622998115
N_neighbor = 274, Mean Sigma = 106.40786080387052
N_neighbor = 284, Mean Sigma = 106.79015766974933
N_neighbor = 294, Mean Sigma = 107.15732361351311
N_neighbor = 304, Mean Sigma = 107.51049745016257
N_neighbor = 314, Mean Sigma = 107.85069811943524
N_neighbor = 324, Mean Sigma = 108.17878872322636
N_neighbor = 334, Mean Sigma = 108.49549384090487
N_neighbor = 344, Mean Sigma = 108.80167257852395
N_neighbor = 354, Mean Sigma = 109.09785105529444
N_neighbor = 364, Mean Sigma = 109.3846912490589
N_neighbor = 374, Mean Sigma = 109.66277255692296
N_neighbor = 384, Mean Sigma = 109.93252652983426
N_neighbor = 394, Mean Sigma = 110.19447529116157
N_neighbor = 404, Mean Sigma = 110.44894383606298
N_neighbor = 414, Mean Sigma = 110.69650623385466
N_neighbor = 424, Mean Sigma = 110.937336969642
N_neighbor = 434, Mean Sigma = 111.17187558605684
N_neighbor = 444, Mean Sigma = 111.4004124476257
N_neighbor = 454, Mean Sigma = 111.62315933398028
N_neighbor = 464, Mean Sigma = 111.84056511138405
N_neighbor = 474, Mean Sigma = 112.05266174657385
N_neighbor = 484, Mean Sigma = 112.25981685702361
N_neighbor = 494, Mean Sigma = 112.46229816415456
N_neighbor = 504, Mean Sigma = 112.66013896665093
N_neighbor = 514, Mean Sigma = 112.85363895267082
N_neighbor = 524, Mean Sigma = 113.0430019101617
N_neighbor = 534, Mean Sigma = 113.22831175180787
N_neighbor = 544, Mean Sigma = 113.40981488787263
N_neighbor = 554, Mean Sigma = 113.58761121440867
N_neighbor = 564, Mean Sigma = 113.76185390536345
N_neighbor = 574, Mean Sigma = 113.93263619705286
N_neighbor = 584, Mean Sigma = 114.10019517610858
N_neighbor = 594, Mean Sigma = 114.26451219526749
N_neighbor = 604, Mean Sigma = 114.42583766063498
N_neighbor = 614, Mean Sigma = 114.58416890831633
N_neighbor = 624, Mean Sigma = 114.73968575120638
N_neighbor = 634, Mean Sigma = 114.89240550462094
N_neighbor = 644, Mean Sigma = 115.04246269524431
N_neighbor = 654, Mean Sigma = 115.18999850949761
N_neighbor = 664, Mean Sigma = 115.33497432090716
N_neighbor = 674, Mean Sigma = 115.4775965813152
N_neighbor = 684, Mean Sigma = 115.61787461435328
N_neighbor = 694, Mean Sigma = 115.7558190756004
N_neighbor = 704, Mean Sigma = 115.89167504337247
N_neighbor = 714, Mean Sigma = 116.02541321482738
In [7]:
plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20)
plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()

Let us compare how Perplexity and N_neighbors hyperparameters behave for tSNE and UMAP, respectively, for the same data set where the Euclidean distances are fixed:

In [8]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')

plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
In [21]:
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
In [25]:
plt.figure(figsize=(20, 15))

plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')

plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.xlim(0,500); plt.ylim(0,600)
plt.show()
In [31]:
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.xlim(-0.1,0.5)
plt.show()

tSNE Degrades to PCA at Large Perplexity

Here we will try to show for different non-linear manifolds (Swiss Roll, S-shape and Sphere) that tSNE degrades down to PCA (providing that PCA was used for initialization) at large perplexity values. There is a common belief that at large perplexities tSNE recovers the initial data set, i.e. is capable of preserving global structure of the data. We will start with a 2D World Map collection of points and embedd it into 3D non-linear manifold later.

In [2]:
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader

shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
                                        category='cultural', name=shapename)

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    #print(country.attributes['NAME_LONG'])
    if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('NorthAmerica.png')
plt.close()
        
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela', 
                                           'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('SouthAmerica.png')
plt.close()

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Australia']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Australia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia', 
                                          'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden', 
                                           'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway', 
                                           'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria', 
                                           'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland', 
                                           'United Kingdom', 'Saudi Arabia', 'Hungary']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Eurasia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
                                           'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia', 
                                           'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
                                           'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic', 
                                           'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia', 
                                           'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa', 
                                           'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
                                           'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 
                                           'Nigeria', 'Cameroon']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Africa.png')
plt.close()


rng = np.random.RandomState(123)
plt.figure(figsize = (20,15))

N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)

N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)

N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)

N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)

N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)

plt.title('Original World Map Data Set', fontsize = 25)
plt.xlabel('Dimension 1', fontsize = 22); plt.ylabel('Dimension 2', fontsize = 22)

X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)

plt.show()
(3023, 2)
(3023,)

Now let us embedd the 2D World Map collection of points into 3D Swiss Roll non-linear manifold:

In [3]:
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)

X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
Out[3]:
(3023, 3)
In [16]:
from mpl_toolkits import mplot3d
plt.figure(figsize=(20,15))
ax = plt.axes(projection = '3d')
ax.scatter3D(X_swiss_roll[:, 0], X_swiss_roll[:, 1], X_swiss_roll[:, 2], c = y)
plt.show()

Next we will run PCA on the 3D Swiss Roll with embedded 2D World Map data points and compare the result with tSNE at very large perplexity = 2000, which is of the same order of magnitude as the sample size, i.e. 3023 data points.

In [121]:
from sklearn.decomposition import PCA
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(X_swiss_roll_reduced[:, 0], X_swiss_roll_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 25)
plt.xlabel("PCA1", fontsize = 22); plt.ylabel("PCA2", fontsize = 22)
plt.show()
In [196]:
from sklearn.manifold import TSNE
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 2000, 
             init = X_swiss_roll_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 25); plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 3022 nearest neighbors...
[t-SNE] Indexed 3023 samples in 0.000s...
[t-SNE] Computed neighbors for 3023 samples in 1.606s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.615106
[t-SNE] Computed conditional probabilities in 4.002s
[t-SNE] Iteration 50: error = 34.5878296, gradient norm = 0.0000004 (50 iterations in 4.020s)
[t-SNE] Iteration 100: error = 34.5919189, gradient norm = 0.0000000 (50 iterations in 3.833s)
[t-SNE] Iteration 100: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 100 iterations with early exaggeration: 34.591919
[t-SNE] Iteration 150: error = 0.1602514, gradient norm = 0.0028735 (50 iterations in 3.974s)
[t-SNE] Iteration 200: error = 0.0249942, gradient norm = 0.0002043 (50 iterations in 4.112s)
[t-SNE] Iteration 250: error = 0.0248579, gradient norm = 0.0000229 (50 iterations in 4.942s)
[t-SNE] Iteration 300: error = 0.0248000, gradient norm = 0.0000251 (50 iterations in 4.002s)
[t-SNE] Iteration 350: error = 0.0248179, gradient norm = 0.0000202 (50 iterations in 4.832s)
[t-SNE] Iteration 400: error = 0.0247878, gradient norm = 0.0000306 (50 iterations in 4.238s)
[t-SNE] Iteration 450: error = 0.0246588, gradient norm = 0.0000330 (50 iterations in 4.470s)
[t-SNE] Iteration 500: error = 0.0248645, gradient norm = 0.0000336 (50 iterations in 4.465s)
[t-SNE] Iteration 550: error = 0.0249281, gradient norm = 0.0000234 (50 iterations in 4.294s)
[t-SNE] Iteration 600: error = 0.0250215, gradient norm = 0.0000167 (50 iterations in 4.725s)
[t-SNE] Iteration 650: error = 0.0250395, gradient norm = 0.0000122 (50 iterations in 4.248s)
[t-SNE] Iteration 700: error = 0.0249933, gradient norm = 0.0000115 (50 iterations in 5.551s)
[t-SNE] Iteration 750: error = 0.0249767, gradient norm = 0.0000124 (50 iterations in 4.229s)
[t-SNE] Iteration 800: error = 0.0248114, gradient norm = 0.0000238 (50 iterations in 4.516s)
[t-SNE] Iteration 800: did not make any progress during the last 300 episodes. Finished.
[t-SNE] KL divergence after 800 iterations: 0.024811

We see that PCA and tSNE at large perplexity (with init = PCA) results look very similar. This is is the first indication that there is something wrong going on with tSNE at large perplexities. In contrast to tSNE, UMAP at n_neighbor = 2000 tries to reconstruct the original 2D World Map.

In [199]:
import warnings
warnings.filterwarnings("ignore")

from umap import UMAP
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 2000, 
             init = X_swiss_roll_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 25); plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(a=None, angular_rp_forest=False, b=None,
     init=array([[ 0.12535979,  0.40356518],
       [ 0.00599581,  0.49934623],
       [ 0.12201155,  0.40671414],
       ...,
       [-0.43454162, -0.28447539],
       [-0.55506668,  0.15485573],
       [-0.53099092, -0.11675039]]),
     learning_rate=1, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=1, n_components=2, n_epochs=1000,
     n_neighbors=2000, negative_sample_rate=5, random_state=None,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=2)
Construct fuzzy simplicial set
Fri Mar  6 09:32:24 2020 Finding Nearest Neighbors
Fri Mar  6 09:32:24 2020 Finished Nearest Neighbor Search
Fri Mar  6 09:32:29 2020 Construct embedding
	completed  0  /  1000 epochs
	completed  100  /  1000 epochs
	completed  200  /  1000 epochs
	completed  300  /  1000 epochs
	completed  400  /  1000 epochs
	completed  500  /  1000 epochs
	completed  600  /  1000 epochs
	completed  700  /  1000 epochs
	completed  800  /  1000 epochs
	completed  900  /  1000 epochs
Fri Mar  6 09:33:07 2020 Finished embedding

Now let us embed the 2D World Map collection of points into 3D S-shaped non-linear manifold:

In [59]:
#t = 3 * np.pi * (generator.rand(1, n_samples) - 0.5)
#x = np.sin(t)
#y = 2.0 * generator.rand(1, n_samples)
#z = np.sign(t) * (np.cos(t) - 1)
In [173]:
t = (X[:, 0] - 0.6) * 4 * np.pi
x_S = np.sin(t)
y_S = 2 * X[:, 1]
z_S = np.sign(t) * (np.cos(t) - 1)

X_S = np.array([x_S, y_S, z_S]).T
X_S.shape
Out[173]:
(3023, 3)
In [194]:
from mpl_toolkits import mplot3d
plt.figure(figsize=(20,15))
ax = plt.axes(projection = '3d')
ax.scatter3D(X_S[:, 0], X_S[:, 1], X_S[:, 2], c = y)
#ax.view_init(30, 120)
plt.show()

Again, we will run PCA followed by tSNE and UMAP on the S-shaped 3D embedding of the original 2D World Map collection of data points, and demonstrate that at perplexity = 2000 tSNE output surprisingly resembles the PCA output, while UMAP with the similar hyperparameter n_neighbor = 2000 creates a more meaningful reconstruction of the original 2D World Map.

In [120]:
from sklearn.decomposition import PCA
X_S_reduced = PCA(n_components = 2).fit_transform(X_S)
plt.figure(figsize=(20,15))
plt.scatter(X_S_reduced[:, 0], X_S_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 25)
plt.xlabel("PCA1", fontsize = 22); plt.ylabel("PCA2", fontsize = 22)
plt.show()
In [211]:
from sklearn.manifold import TSNE
X_S_reduced = PCA(n_components = 2).fit_transform(X_S)
model = TSNE(learning_rate = 10000, n_components = 2, random_state = 123, perplexity = 2000, 
             init = X_S_reduced, n_iter = 1000, verbose = 2, early_exaggeration=124)
tsne = model.fit_transform(X_S)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 25); plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 3022 nearest neighbors...
[t-SNE] Indexed 3023 samples in 0.000s...
[t-SNE] Computed neighbors for 3023 samples in 0.953s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 1.448983
[t-SNE] Computed conditional probabilities in 3.886s
[t-SNE] Iteration 50: error = 1118.7269287, gradient norm = 0.1310913 (50 iterations in 3.954s)
[t-SNE] Iteration 100: error = 1052.6090088, gradient norm = 0.1829850 (50 iterations in 5.769s)
[t-SNE] Iteration 150: error = 983.7592773, gradient norm = 0.1777281 (50 iterations in 6.034s)
[t-SNE] Iteration 200: error = 849.8014526, gradient norm = 0.2291984 (50 iterations in 5.704s)
[t-SNE] Iteration 250: error = 885.8141479, gradient norm = 0.3025164 (50 iterations in 5.832s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 885.814148
[t-SNE] Iteration 300: error = 1.4438870, gradient norm = 0.0010251 (50 iterations in 4.890s)
[t-SNE] Iteration 350: error = 0.0720827, gradient norm = 0.0035229 (50 iterations in 4.979s)
[t-SNE] Iteration 400: error = 0.0173820, gradient norm = 0.0008265 (50 iterations in 5.877s)
[t-SNE] Iteration 450: error = 0.0167094, gradient norm = 0.0005168 (50 iterations in 3.969s)
[t-SNE] Iteration 500: error = 0.0165868, gradient norm = 0.0004544 (50 iterations in 4.299s)
[t-SNE] Iteration 550: error = 0.0159346, gradient norm = 0.0003132 (50 iterations in 5.752s)
[t-SNE] Iteration 600: error = 0.0170120, gradient norm = 0.0005140 (50 iterations in 4.298s)
[t-SNE] Iteration 650: error = 0.0157751, gradient norm = 0.0001777 (50 iterations in 4.853s)
[t-SNE] Iteration 700: error = 0.0158459, gradient norm = 0.0001374 (50 iterations in 4.132s)
[t-SNE] Iteration 750: error = 0.0165553, gradient norm = 0.0003649 (50 iterations in 6.505s)
[t-SNE] Iteration 800: error = 0.0156519, gradient norm = 0.0001634 (50 iterations in 9.178s)
[t-SNE] Iteration 850: error = 0.0157948, gradient norm = 0.0001167 (50 iterations in 7.105s)
[t-SNE] Iteration 900: error = 0.0159746, gradient norm = 0.0002014 (50 iterations in 4.817s)
[t-SNE] Iteration 950: error = 0.0163133, gradient norm = 0.0005700 (50 iterations in 4.745s)
[t-SNE] Iteration 1000: error = 0.0157455, gradient norm = 0.0001063 (50 iterations in 5.648s)
[t-SNE] KL divergence after 1000 iterations: 0.015745
In [200]:
import warnings
warnings.filterwarnings("ignore")

from umap import UMAP
X_S_reduced = PCA(n_components = 2).fit_transform(X_S)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 2000, 
             init = X_S_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X_S)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 25); plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(a=None, angular_rp_forest=False, b=None,
     init=array([[-1.47279946, -0.75488543],
       [-1.80675933, -0.28930955],
       [-1.46639981, -0.73331449],
       ...,
       [ 0.09624343,  0.3243066 ],
       [-0.61412673,  1.01673621],
       [-0.08959831,  0.68531188]]),
     learning_rate=1, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=1, n_components=2, n_epochs=1000,
     n_neighbors=2000, negative_sample_rate=5, random_state=None,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=2)
Construct fuzzy simplicial set
Fri Mar  6 09:35:00 2020 Finding Nearest Neighbors
Fri Mar  6 09:35:00 2020 Finished Nearest Neighbor Search
Fri Mar  6 09:35:04 2020 Construct embedding
	completed  0  /  1000 epochs
	completed  100  /  1000 epochs
	completed  200  /  1000 epochs
	completed  300  /  1000 epochs
	completed  400  /  1000 epochs
	completed  500  /  1000 epochs
	completed  600  /  1000 epochs
	completed  700  /  1000 epochs
	completed  800  /  1000 epochs
	completed  900  /  1000 epochs
Fri Mar  6 09:35:41 2020 Finished embedding

Finally, let us embedd the 2D World Map collection of points into 3D Sphere non-linear manifold, this will sort of represent the globe with the continents mapped to it. Again we will run PCA, tSNE (perplexity = 2000) and UMAP (n_neighbor = 2000) and compare the outputs:

In [153]:
p = X[:, 0] * (3 * np.pi - 0.6)
t = X[:, 1] * np.pi

x_sphere = np.sin(t) * np.cos(p)
y_sphere = np.sin(t) * np.sin(p)
z_sphere = np.cos(t)

X_sphere = np.array([x_sphere, y_sphere, z_sphere]).T
X_sphere.shape
Out[153]:
(3023, 3)
In [171]:
from mpl_toolkits import mplot3d
plt.figure(figsize=(20,15))
ax = plt.axes(projection = '3d')
ax.view_init(10, 60)
ax.scatter3D(X_sphere[:, 0], X_sphere[:, 1], -X_sphere[:, 2], c = y)
plt.show()
In [172]:
from sklearn.decomposition import PCA
X_sphere_reduced = PCA(n_components = 2).fit_transform(X_sphere)
plt.figure(figsize=(20,15))
plt.scatter(X_sphere_reduced[:, 0], X_sphere_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 25)
plt.xlabel("PCA1", fontsize = 22); plt.ylabel("PCA2", fontsize = 22)
plt.show()
In [198]:
from sklearn.manifold import TSNE
X_sphere_reduced = PCA(n_components = 2).fit_transform(X_sphere)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 2000, 
             init = X_sphere_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X_sphere)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 25); plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 3022 nearest neighbors...
[t-SNE] Indexed 3023 samples in 0.000s...
[t-SNE] Computed neighbors for 3023 samples in 1.185s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 1.077166
[t-SNE] Computed conditional probabilities in 3.774s
[t-SNE] Iteration 50: error = 34.5676880, gradient norm = 0.0000007 (50 iterations in 4.207s)
[t-SNE] Iteration 100: error = 34.5713806, gradient norm = 0.0000000 (50 iterations in 3.844s)
[t-SNE] Iteration 100: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 100 iterations with early exaggeration: 34.571381
[t-SNE] Iteration 150: error = 0.1153407, gradient norm = 0.0036319 (50 iterations in 4.265s)
[t-SNE] Iteration 200: error = 0.0286314, gradient norm = 0.0001743 (50 iterations in 4.767s)
[t-SNE] Iteration 250: error = 0.0275777, gradient norm = 0.0001394 (50 iterations in 4.427s)
[t-SNE] Iteration 300: error = 0.0274423, gradient norm = 0.0000265 (50 iterations in 4.319s)
[t-SNE] Iteration 350: error = 0.0274453, gradient norm = 0.0000255 (50 iterations in 4.698s)
[t-SNE] Iteration 400: error = 0.0274686, gradient norm = 0.0000264 (50 iterations in 4.148s)
[t-SNE] Iteration 450: error = 0.0274986, gradient norm = 0.0000290 (50 iterations in 4.957s)
[t-SNE] Iteration 500: error = 0.0275053, gradient norm = 0.0000275 (50 iterations in 4.091s)
[t-SNE] Iteration 550: error = 0.0274970, gradient norm = 0.0000280 (50 iterations in 4.818s)
[t-SNE] Iteration 600: error = 0.0274998, gradient norm = 0.0000284 (50 iterations in 4.270s)
[t-SNE] Iteration 650: error = 0.0274858, gradient norm = 0.0000321 (50 iterations in 4.551s)
[t-SNE] Iteration 650: did not make any progress during the last 300 episodes. Finished.
[t-SNE] KL divergence after 650 iterations: 0.027486
In [201]:
import warnings
warnings.filterwarnings("ignore")

from umap import UMAP
X_sphere_reduced = PCA(n_components = 2).fit_transform(X_sphere)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 2000, 
             init = X_sphere_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X_sphere)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 25); plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(a=None, angular_rp_forest=False, b=None,
     init=array([[ 1.05712819,  0.2271468 ],
       [ 1.11068641, -0.18788974],
       [ 1.14823761,  0.18744746],
       ...,
       [-0.65540008, -0.5460625 ],
       [-0.05684139, -0.90522455],
       [-0.4559346 , -0.73633387]]),
     learning_rate=1, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=1, n_components=2, n_epochs=1000,
     n_neighbors=2000, negative_sample_rate=5, random_state=None,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=2)
Construct fuzzy simplicial set
Fri Mar  6 09:36:46 2020 Finding Nearest Neighbors
Fri Mar  6 09:36:47 2020 Finished Nearest Neighbor Search
Fri Mar  6 09:36:51 2020 Construct embedding
	completed  0  /  1000 epochs
	completed  100  /  1000 epochs
	completed  200  /  1000 epochs
	completed  300  /  1000 epochs
	completed  400  /  1000 epochs
	completed  500  /  1000 epochs
	completed  600  /  1000 epochs
	completed  700  /  1000 epochs
	completed  800  /  1000 epochs
	completed  900  /  1000 epochs
Fri Mar  6 09:37:28 2020 Finished embedding

For the Sphere non-linear manifold we conclude the same thig as for the Swiss Roll and S-shape, i.e. PCA nad tSNE at large perplexity outputs look very similar while UMAP at large n_neighbor tries to reconstruct the original 2D World Map.

Since tSNE implements a gradient descent for optimization of the KL-divergence, it seems to be the KL-gradient that disappears at large perplexities and therefore one ends up with the PCA if PCA was used for initialization of the gradient descent. Here we will take a closer look at the gradient of tSNE in order to prove that it goes to zero at large perplexities and thus tSNE degrades to PCA, provided that PCA has been used for tSNE initialization.

The scikitlearn implementation of tSNE does not provide a comprehensive information about high- and low-dimensional probabilities and the KL-gradient without hacking the codes, but one can still import a few functions from the tSNE Python module that allow to see the distributions of the high- and low-dimensional probabilities, as well as KL-divergence values and distributions of KL-gradient values.

Here we plot the histogram of high-dimensional probability values for the Swiss Roll data set at very large perplexity = 2000:

In [113]:
import seaborn as sns
from sklearn.manifold._t_sne import _joint_probabilities, _kl_divergence
from sklearn.metrics.pairwise import euclidean_distances
dist = np.square(euclidean_distances(X_swiss_roll, X_swiss_roll))
P = _joint_probabilities(distances = dist, desired_perplexity = 2000, verbose = 2)
print(P.shape)

plt.figure(figsize = (20,15))
sns.distplot(P)
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.615106
(4567753,)

There are two interesting things here to pay attention to. First, the histogram does not really look uniform as one might expect. In theory, the Gaussian kernel should transform to a uniform kernel at large perplexity / sigma, but we do not see it here. Second, the probability values are still very small, ~10^(-7) which is very strange because one would expect them to approach 1, since the Gaussian kernel goes to 1 when sigma goes to infinity.

Now let us plot the histogram of the low-dimensional probabilities:

In [112]:
from scipy.spatial.distance import pdist
degrees_of_freedom = 1
MACHINE_EPSILON = np.finfo(np.double).eps
X_embedded = X_reduced.reshape(X_swiss_roll.shape[0], 2)
dist = pdist(X_embedded, "sqeuclidean")
dist /= degrees_of_freedom
dist += 1.
dist **= (degrees_of_freedom + 1.0) / -2.0
Q = np.maximum(dist / (2.0 * np.sum(dist)), MACHINE_EPSILON)
print(Q.shape)
plt.figure(figsize = (20,15))
sns.distplot(Q)
plt.title("tSNE: LOW-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
(4567753,)
In [163]:
#P = P * 12
#2.0 * np.dot(P, np.log(np.maximum(P, MACHINE_EPSILON) / Q))

And finally, let us print the KL-divergence value, the KL-gradient vector and Frobenius norm of the KL-gradient vector, as well as plot the histogram of the KL-gradient values:

In [162]:
from sklearn.decomposition import PCA
from sklearn.manifold._t_sne import _kl_divergence
X_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
kl, grad = _kl_divergence(params = X_reduced, P = P, 
                          degrees_of_freedom = 1, n_samples = X_swiss_roll.shape[0], n_components = 2)
print(kl)
print(grad)
print(grad.shape)
print(np.linalg.norm(grad))

plt.figure(figsize = (20,15))
sns.distplot(grad)
plt.title("tSNE: GRADIENT OF KL-DIVERGENCE", fontsize = 20)
plt.xlabel("GRADIENT", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
0.19772935156019597
[-3.35862231e-05 -1.43664010e-04 -1.37686982e-05 ... -3.04112302e-06
  1.38647559e-04  4.44109973e-05]
(6046,)
0.007856695290534468

Finally, let us gradually increase the perplexity values and check how Frobenius norm of the KL-gradient behaves as a function of perplexity. Frobenius norm is just a Euclidean norm of a vector / matrix, we are using it for simplicity for capturing information about the KL-gradient vector in one value.

In [17]:
from sklearn.decomposition import PCA
from sklearn.manifold._t_sne import _joint_probabilities, _kl_divergence
from sklearn.metrics.pairwise import euclidean_distances

X_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
dist = np.square(euclidean_distances(X_swiss_roll, X_swiss_roll))

my_grad_norm = []
for PERP in range(3,3000,50):
    P = _joint_probabilities(distances = dist, desired_perplexity = PERP, verbose = 0)
    #P = 12 * P
    kl, grad = _kl_divergence(params = X_reduced, P = P, 
                              degrees_of_freedom = 1, n_samples = X_swiss_roll.shape[0], n_components = 2)
    grad_norm = np.linalg.norm(grad)
    #grad_norm = np.mean(grad)
    my_grad_norm.append(grad_norm)
    print('Perplexity = {0}, KL-divergence = {1}, grad_norm = {2}'.format(PERP, kl, grad_norm))

plt.figure(figsize = (20,15))
plt.plot(range(3,3000,50), my_grad_norm, '-o')
plt.title("tSNE: Gradient vs. perplexity", fontsize = 20)
plt.xlabel("Perplexity", fontsize = 20)
plt.ylabel("Gradient", fontsize = 20)
plt.show()
Perplexity = 3, KL-divergence = 6.425246929740883, grad_norm = 0.017910456730326773
Perplexity = 53, KL-divergence = 3.6183483271934085, grad_norm = 0.017852346247683164
Perplexity = 103, KL-divergence = 2.952651491603908, grad_norm = 0.0177726417703892
Perplexity = 153, KL-divergence = 2.556617956939391, grad_norm = 0.01768317832496556
Perplexity = 203, KL-divergence = 2.2752020950398997, grad_norm = 0.017583055472569407
Perplexity = 253, KL-divergence = 2.0576415779852653, grad_norm = 0.01747092400186825
Perplexity = 303, KL-divergence = 1.8801636512971531, grad_norm = 0.017345346149643132
Perplexity = 353, KL-divergence = 1.7301104669288927, grad_norm = 0.017204741403936224
Perplexity = 403, KL-divergence = 1.6005626681475063, grad_norm = 0.0170489164229274
Perplexity = 453, KL-divergence = 1.4870289131570757, grad_norm = 0.016878886498689728
Perplexity = 503, KL-divergence = 1.3862953522393138, grad_norm = 0.01669598816448813
Perplexity = 553, KL-divergence = 1.296061000168976, grad_norm = 0.01650158416435802
Perplexity = 603, KL-divergence = 1.2146555822169063, grad_norm = 0.016297124452911534
Perplexity = 653, KL-divergence = 1.1407653565756184, grad_norm = 0.016083900656374765
Perplexity = 703, KL-divergence = 1.0732977683796863, grad_norm = 0.01586268744639043
Perplexity = 753, KL-divergence = 1.0113186659386078, grad_norm = 0.015633505172397075
Perplexity = 803, KL-divergence = 0.9540150435357431, grad_norm = 0.015395517570353734
Perplexity = 853, KL-divergence = 0.9006687532653352, grad_norm = 0.015147041905333884
Perplexity = 903, KL-divergence = 0.8506494306566317, grad_norm = 0.014885916378671432
Perplexity = 953, KL-divergence = 0.8034517764075318, grad_norm = 0.01461105015077518
Perplexity = 1003, KL-divergence = 0.7587658948781871, grad_norm = 0.014324018671559705
Perplexity = 1053, KL-divergence = 0.716406904383116, grad_norm = 0.014027570059840453
Perplexity = 1103, KL-divergence = 0.6762168557128664, grad_norm = 0.013724066724334618
Perplexity = 1153, KL-divergence = 0.6380430075592185, grad_norm = 0.01341524951555859
Perplexity = 1203, KL-divergence = 0.6017432396203883, grad_norm = 0.013102396551808672
Perplexity = 1253, KL-divergence = 0.5671826257863458, grad_norm = 0.01278640667279637
Perplexity = 1303, KL-divergence = 0.5342389889147205, grad_norm = 0.012467932132436292
Perplexity = 1353, KL-divergence = 0.5028023544559572, grad_norm = 0.012147444109640866
Perplexity = 1403, KL-divergence = 0.4727712762762647, grad_norm = 0.011825247564062315
Perplexity = 1453, KL-divergence = 0.44405527266962286, grad_norm = 0.011501545703809499
Perplexity = 1503, KL-divergence = 0.4165734244167749, grad_norm = 0.011176461531232711
Perplexity = 1553, KL-divergence = 0.39025223508599144, grad_norm = 0.010850039369902437
Perplexity = 1603, KL-divergence = 0.36502596957931205, grad_norm = 0.010522269587716017
Perplexity = 1653, KL-divergence = 0.34083536211208953, grad_norm = 0.010193091988906318
Perplexity = 1703, KL-divergence = 0.3176276398688937, grad_norm = 0.009862409642224857
Perplexity = 1753, KL-divergence = 0.2953547391276683, grad_norm = 0.009530077021727748
Perplexity = 1803, KL-divergence = 0.2739742889752798, grad_norm = 0.009195922603188696
Perplexity = 1853, KL-divergence = 0.25344812494195396, grad_norm = 0.008859732040332064
Perplexity = 1903, KL-divergence = 0.23374287748078876, grad_norm = 0.008521265433029244
Perplexity = 1953, KL-divergence = 0.21482850175993995, grad_norm = 0.008180233565275708
Perplexity = 2003, KL-divergence = 0.1966781447337792, grad_norm = 0.007836292957276446
Perplexity = 2053, KL-divergence = 0.17926978514739406, grad_norm = 0.0074890744728293365
Perplexity = 2103, KL-divergence = 0.1625845530809017, grad_norm = 0.00713815051862531
Perplexity = 2153, KL-divergence = 0.14660692416515844, grad_norm = 0.0067830257499968925
Perplexity = 2203, KL-divergence = 0.13132445026609058, grad_norm = 0.0064231185387067825
Perplexity = 2253, KL-divergence = 0.11673028248436075, grad_norm = 0.006057801444651017
Perplexity = 2303, KL-divergence = 0.10282053603738542, grad_norm = 0.00568631656747074
Perplexity = 2353, KL-divergence = 0.08959602624589436, grad_norm = 0.005307778345982938
Perplexity = 2403, KL-divergence = 0.07706358184244033, grad_norm = 0.004921164925703301
Perplexity = 2453, KL-divergence = 0.06523700287893658, grad_norm = 0.004525278573995795
Perplexity = 2503, KL-divergence = 0.054137654626230824, grad_norm = 0.004118672252360917
Perplexity = 2553, KL-divergence = 0.04379834023449195, grad_norm = 0.003699648442496384
Perplexity = 2603, KL-divergence = 0.03426572821783321, grad_norm = 0.0032661703896046715
Perplexity = 2653, KL-divergence = 0.025606928400574613, grad_norm = 0.0028158939213104727
Perplexity = 2703, KL-divergence = 0.017918414733508765, grad_norm = 0.0023463536479253625
Perplexity = 2753, KL-divergence = 0.011343522265985312, grad_norm = 0.0018561068600688906
Perplexity = 2803, KL-divergence = 0.006102757429988575, grad_norm = 0.0013499580382743443
Perplexity = 2853, KL-divergence = 0.002561824486026828, grad_norm = 0.0008705926688346034
Perplexity = 2903, KL-divergence = 0.0013990590985975598, grad_norm = 0.0006881141985406209
Perplexity = 2953, KL-divergence = 0.004161949656519749, grad_norm = 0.0012127576284191462

We conclude that the KL-gradient decreases with the increase of perplexity reaching almost zero values when perplexity approaches the sample size of the Swiss Roll data set. Now we are going to reproduce as much as possible the output of the scikitlearn implementation of tSNE. This means that we have to write functions for high- and low-dimensional probabilities, KL-divergence, KL-gradient and gradient descent, so we need to make sure that we mimic the scikitlearn output, i.e. the probability histograms as well as KL and KL-gradient values look identical. First, we run the tSNE (scikitlearn implementation) on the Swiss Roll again using method = 'exact' and early_exaggeration = 1, this output we will try to mimic later.

In [16]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 2000, 
             init = X_reduced, n_iter = 1000, verbose = 2, method = 'exact', early_exaggeration = 1)
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 20); plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.615106
[t-SNE] Iteration 1: error = 0.1977460, gradient norm = 0.0078570 (1 iterations in 0.379s)
[t-SNE] Iteration 2: error = 0.1880074, gradient norm = 0.0076339 (1 iterations in 0.371s)
[t-SNE] Iteration 3: error = 0.1719964, gradient norm = 0.0072181 (1 iterations in 0.382s)
[t-SNE] Iteration 4: error = 0.1525274, gradient norm = 0.0066357 (1 iterations in 0.372s)
[t-SNE] Iteration 5: error = 0.1320209, gradient norm = 0.0059367 (1 iterations in 0.375s)
[t-SNE] Iteration 6: error = 0.1123988, gradient norm = 0.0051887 (1 iterations in 0.371s)
[t-SNE] Iteration 7: error = 0.0949130, gradient norm = 0.0044568 (1 iterations in 0.375s)
[t-SNE] Iteration 8: error = 0.0801260, gradient norm = 0.0037878 (1 iterations in 0.370s)
[t-SNE] Iteration 9: error = 0.0680667, gradient norm = 0.0032044 (1 iterations in 0.368s)
[t-SNE] Iteration 10: error = 0.0584547, gradient norm = 0.0027096 (1 iterations in 0.371s)
[t-SNE] Iteration 11: error = 0.0508936, gradient norm = 0.0022943 (1 iterations in 0.371s)
[t-SNE] Iteration 12: error = 0.0449909, gradient norm = 0.0019450 (1 iterations in 0.366s)
[t-SNE] Iteration 13: error = 0.0404080, gradient norm = 0.0016494 (1 iterations in 0.368s)
[t-SNE] Iteration 14: error = 0.0368695, gradient norm = 0.0013978 (1 iterations in 0.367s)
[t-SNE] Iteration 15: error = 0.0341534, gradient norm = 0.0011834 (1 iterations in 0.367s)
[t-SNE] Iteration 16: error = 0.0320806, gradient norm = 0.0010012 (1 iterations in 0.365s)
[t-SNE] Iteration 17: error = 0.0305069, gradient norm = 0.0008463 (1 iterations in 0.365s)
[t-SNE] Iteration 18: error = 0.0293179, gradient norm = 0.0007146 (1 iterations in 0.367s)
[t-SNE] Iteration 19: error = 0.0284239, gradient norm = 0.0006024 (1 iterations in 0.367s)
[t-SNE] Iteration 20: error = 0.0277551, gradient norm = 0.0005069 (1 iterations in 0.364s)
[t-SNE] Iteration 21: error = 0.0272579, gradient norm = 0.0004255 (1 iterations in 0.366s)
[t-SNE] Iteration 22: error = 0.0268907, gradient norm = 0.0003563 (1 iterations in 0.367s)
[t-SNE] Iteration 23: error = 0.0266213, gradient norm = 0.0002975 (1 iterations in 0.373s)
[t-SNE] Iteration 24: error = 0.0264251, gradient norm = 0.0002477 (1 iterations in 0.367s)
[t-SNE] Iteration 25: error = 0.0262832, gradient norm = 0.0002056 (1 iterations in 0.378s)
[t-SNE] Iteration 26: error = 0.0261814, gradient norm = 0.0001702 (1 iterations in 0.366s)
[t-SNE] Iteration 27: error = 0.0261089, gradient norm = 0.0001405 (1 iterations in 0.365s)
[t-SNE] Iteration 28: error = 0.0260576, gradient norm = 0.0001157 (1 iterations in 0.373s)
[t-SNE] Iteration 29: error = 0.0260215, gradient norm = 0.0000952 (1 iterations in 0.367s)
[t-SNE] Iteration 30: error = 0.0259962, gradient norm = 0.0000782 (1 iterations in 0.370s)
[t-SNE] Iteration 31: error = 0.0259786, gradient norm = 0.0000643 (1 iterations in 0.368s)
[t-SNE] Iteration 32: error = 0.0259663, gradient norm = 0.0000530 (1 iterations in 0.370s)
[t-SNE] Iteration 33: error = 0.0259577, gradient norm = 0.0000438 (1 iterations in 0.367s)
[t-SNE] Iteration 34: error = 0.0259518, gradient norm = 0.0000365 (1 iterations in 0.368s)
[t-SNE] Iteration 35: error = 0.0259476, gradient norm = 0.0000306 (1 iterations in 0.369s)
[t-SNE] Iteration 36: error = 0.0259446, gradient norm = 0.0000260 (1 iterations in 0.372s)
[t-SNE] Iteration 37: error = 0.0259425, gradient norm = 0.0000225 (1 iterations in 0.370s)
[t-SNE] Iteration 38: error = 0.0259409, gradient norm = 0.0000197 (1 iterations in 0.370s)
[t-SNE] Iteration 39: error = 0.0259398, gradient norm = 0.0000176 (1 iterations in 0.371s)
[t-SNE] Iteration 40: error = 0.0259389, gradient norm = 0.0000159 (1 iterations in 0.372s)
[t-SNE] Iteration 41: error = 0.0259382, gradient norm = 0.0000146 (1 iterations in 0.374s)
[t-SNE] Iteration 42: error = 0.0259376, gradient norm = 0.0000136 (1 iterations in 0.374s)
[t-SNE] Iteration 43: error = 0.0259370, gradient norm = 0.0000127 (1 iterations in 0.365s)
[t-SNE] Iteration 44: error = 0.0259366, gradient norm = 0.0000119 (1 iterations in 0.369s)
[t-SNE] Iteration 45: error = 0.0259362, gradient norm = 0.0000112 (1 iterations in 0.367s)
[t-SNE] Iteration 46: error = 0.0259359, gradient norm = 0.0000105 (1 iterations in 0.368s)
[t-SNE] Iteration 47: error = 0.0259355, gradient norm = 0.0000099 (1 iterations in 0.374s)
[t-SNE] Iteration 48: error = 0.0259353, gradient norm = 0.0000093 (1 iterations in 0.369s)
[t-SNE] Iteration 49: error = 0.0259350, gradient norm = 0.0000088 (1 iterations in 0.368s)
[t-SNE] Iteration 50: error = 0.0259348, gradient norm = 0.0000083 (1 iterations in 0.372s)
[t-SNE] Iteration 51: error = 0.0259346, gradient norm = 0.0000078 (1 iterations in 0.368s)
[t-SNE] Iteration 52: error = 0.0259344, gradient norm = 0.0000073 (1 iterations in 0.368s)
[t-SNE] Iteration 53: error = 0.0259342, gradient norm = 0.0000069 (1 iterations in 0.366s)
[t-SNE] Iteration 54: error = 0.0259340, gradient norm = 0.0000064 (1 iterations in 0.368s)
[t-SNE] Iteration 55: error = 0.0259339, gradient norm = 0.0000060 (1 iterations in 0.367s)
[t-SNE] Iteration 56: error = 0.0259338, gradient norm = 0.0000056 (1 iterations in 0.365s)
[t-SNE] Iteration 57: error = 0.0259337, gradient norm = 0.0000052 (1 iterations in 0.367s)
[t-SNE] Iteration 58: error = 0.0259336, gradient norm = 0.0000048 (1 iterations in 0.372s)
[t-SNE] Iteration 59: error = 0.0259335, gradient norm = 0.0000045 (1 iterations in 0.367s)
[t-SNE] Iteration 60: error = 0.0259334, gradient norm = 0.0000041 (1 iterations in 0.370s)
[t-SNE] Iteration 61: error = 0.0259333, gradient norm = 0.0000038 (1 iterations in 0.368s)
[t-SNE] Iteration 62: error = 0.0259333, gradient norm = 0.0000035 (1 iterations in 0.369s)
[t-SNE] Iteration 63: error = 0.0259332, gradient norm = 0.0000032 (1 iterations in 0.367s)
[t-SNE] Iteration 64: error = 0.0259332, gradient norm = 0.0000029 (1 iterations in 0.366s)
[t-SNE] Iteration 65: error = 0.0259332, gradient norm = 0.0000026 (1 iterations in 0.367s)
[t-SNE] Iteration 66: error = 0.0259331, gradient norm = 0.0000024 (1 iterations in 0.364s)
[t-SNE] Iteration 67: error = 0.0259331, gradient norm = 0.0000021 (1 iterations in 0.367s)
[t-SNE] Iteration 68: error = 0.0259331, gradient norm = 0.0000019 (1 iterations in 0.369s)
[t-SNE] Iteration 69: error = 0.0259331, gradient norm = 0.0000017 (1 iterations in 0.377s)
[t-SNE] Iteration 70: error = 0.0259331, gradient norm = 0.0000016 (1 iterations in 0.367s)
[t-SNE] Iteration 71: error = 0.0259330, gradient norm = 0.0000014 (1 iterations in 0.365s)
[t-SNE] Iteration 72: error = 0.0259330, gradient norm = 0.0000013 (1 iterations in 0.370s)
[t-SNE] Iteration 73: error = 0.0259330, gradient norm = 0.0000011 (1 iterations in 0.367s)
[t-SNE] Iteration 74: error = 0.0259330, gradient norm = 0.0000010 (1 iterations in 0.367s)
[t-SNE] Iteration 75: error = 0.0259330, gradient norm = 0.0000009 (1 iterations in 0.366s)
[t-SNE] Iteration 76: error = 0.0259330, gradient norm = 0.0000008 (1 iterations in 0.378s)
[t-SNE] Iteration 77: error = 0.0259330, gradient norm = 0.0000007 (1 iterations in 0.372s)
[t-SNE] Iteration 78: error = 0.0259330, gradient norm = 0.0000007 (1 iterations in 0.367s)
[t-SNE] Iteration 79: error = 0.0259330, gradient norm = 0.0000006 (1 iterations in 0.367s)
[t-SNE] Iteration 80: error = 0.0259330, gradient norm = 0.0000006 (1 iterations in 0.368s)
[t-SNE] Iteration 81: error = 0.0259330, gradient norm = 0.0000005 (1 iterations in 0.371s)
[t-SNE] Iteration 82: error = 0.0259330, gradient norm = 0.0000004 (1 iterations in 0.365s)
[t-SNE] Iteration 83: error = 0.0259330, gradient norm = 0.0000004 (1 iterations in 0.369s)
[t-SNE] Iteration 84: error = 0.0259330, gradient norm = 0.0000004 (1 iterations in 0.366s)
[t-SNE] Iteration 85: error = 0.0259330, gradient norm = 0.0000003 (1 iterations in 0.368s)
[t-SNE] Iteration 86: error = 0.0259330, gradient norm = 0.0000003 (1 iterations in 0.369s)
[t-SNE] Iteration 87: error = 0.0259330, gradient norm = 0.0000003 (1 iterations in 0.369s)
[t-SNE] Iteration 88: error = 0.0259330, gradient norm = 0.0000002 (1 iterations in 0.368s)
[t-SNE] Iteration 89: error = 0.0259330, gradient norm = 0.0000002 (1 iterations in 0.368s)
[t-SNE] Iteration 90: error = 0.0259330, gradient norm = 0.0000002 (1 iterations in 0.367s)
[t-SNE] Iteration 91: error = 0.0259330, gradient norm = 0.0000002 (1 iterations in 0.368s)
[t-SNE] Iteration 92: error = 0.0259330, gradient norm = 0.0000002 (1 iterations in 0.368s)
[t-SNE] Iteration 93: error = 0.0259330, gradient norm = 0.0000001 (1 iterations in 0.368s)
[t-SNE] Iteration 94: error = 0.0259330, gradient norm = 0.0000001 (1 iterations in 0.367s)
[t-SNE] Iteration 95: error = 0.0259330, gradient norm = 0.0000001 (1 iterations in 0.369s)
[t-SNE] Iteration 96: error = 0.0259330, gradient norm = 0.0000001 (1 iterations in 0.367s)
[t-SNE] Iteration 96: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 96 iterations with early exaggeration: 0.025933
[t-SNE] Iteration 97: error = 0.0259330, gradient norm = 0.0000001 (1 iterations in 0.367s)
[t-SNE] Iteration 97: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 97 iterations: 0.025933

Let us first write a function delivering the high-dimensional probabilities to observe points at certain distances:

In [114]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

N_LOW_DIMS = 2
MAX_ITER = 200
PERPLEXITY = 2000
LEARNING_RATE = 0.1

X_train = X_swiss_roll; n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
print('\n')

dist = np.square(euclidean_distances(X_train, X_train))
X_reduced = PCA(n_components = 2).fit_transform(X_train)

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
    func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
    binary_search_result = sigma_binary_search(func, PERPLEXITY)
    prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
    sigma_array.append(binary_search_result)
    if (dist_row + 1) % 1000 == 0:
        print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))

P = (prob + np.transpose(prob)) / (2*n)
P[np.tril_indices(P.shape[0])] = np.nan
P = P[~np.isnan(P)]
print(P.shape)

plt.figure(figsize = (20,15))
sns.distplot(P)
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 3) (3023,)


Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.44061425423882394
(4567753,)

Comparing this histogram with the output of scikitlearn tSNE above we conclude that the distributions look very similar. Very good, let us now mimic the distribution of the low-dimensional probabilities and KL-divergence:

In [128]:
def prob_low_dim(Y):
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    inv_distances[np.tril_indices(Y.shape[0])] = np.nan
    inv_distances = inv_distances[~np.isnan(inv_distances)]
    return inv_distances / (2 * np.sum(inv_distances))

def KL(P, Y):
    Q = prob_low_dim(Y)
    return 2 * np.dot(P, np.log(P / Q))
In [116]:
plt.figure(figsize = (20,15))
sns.distplot(prob_low_dim(X_reduced))
plt.title("tSNE: LOW-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
In [118]:
prob_low_dim(X_reduced).shape
Out[118]:
(4567753,)
In [119]:
P.shape
Out[119]:
(4567753,)
In [117]:
2.0 * np.dot(P, np.log(P / prob_low_dim(X_reduced)))
Out[117]:
0.19772935156019608
In [129]:
KL(P, X_reduced)
Out[129]:
0.19772935156019608

Again, we conclude that this histogram looks almost identical to the histogram above from the scikiltlearn tSNE output. Finally, we reproduce below the KL-divergence vector of scikitlearn tSNE.

In [164]:
from scipy.spatial.distance import squareform
def KL_gradient(P, Y):
    Q = prob_low_dim(Y)
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    inv_distances[np.tril_indices(Y.shape[0])] = np.nan
    inv_distances = inv_distances[~np.isnan(inv_distances)]
    PQd = squareform((P - Q) * inv_distances)
    
    grad = np.ndarray((n, N_LOW_DIMS), dtype = Y.dtype)
    for i in range(n):
        grad[i] = np.dot(np.ravel(PQd[i], order = 'K'), Y[i] - Y)
    
    return 4 * grad.ravel()
In [165]:
KL_gradient(P, X_reduced)
Out[165]:
array([-3.35862231e-05, -1.43664010e-04, -1.37686982e-05, ...,
       -3.04112302e-06,  1.38647559e-04,  4.44109973e-05])

Now we have the building bricks of the tSNE algorithm, let us wrap it up as a single pieace of code delivering all statistics we want to obtain from tSNE.

In [185]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from scipy.spatial.distance import squareform
from sklearn.metrics.pairwise import euclidean_distances

N_LOW_DIMS = 2
MOMENTUM = 0.8
MAX_ITER = 100
PERPLEXITY = 2000
LEARNING_RATE = 200

X_train = X_swiss_roll; n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
print('\n')

dist = np.square(euclidean_distances(X_train, X_train))
plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

X_reduced = PCA(n_components = 2).fit_transform(X_train)

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
    func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
    binary_search_result = sigma_binary_search(func, PERPLEXITY)
    prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
    sigma_array.append(binary_search_result)
    if (dist_row + 1) % 1000 == 0:
        print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))

plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("HISTOGRAM OF SIGMA VALUES", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

P = (prob + np.transpose(prob)) / (2*n)
P[np.tril_indices(P.shape[0])] = np.nan
P = P[~np.isnan(P)]

plt.figure(figsize = (20,15))
sns.distplot(P)
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def prob_low_dim(Y):
    Y = Y.reshape(n, N_LOW_DIMS)
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    inv_distances[np.tril_indices(Y.shape[0])] = np.nan
    inv_distances = inv_distances[~np.isnan(inv_distances)]
    return inv_distances / (2 * np.sum(inv_distances))

def KL(P, Y):
    Y = Y.reshape(n, N_LOW_DIMS)
    Q = prob_low_dim(Y)
    return 2 * np.dot(P, np.log(P / Q))

plt.figure(figsize = (20,15))
sns.distplot(prob_low_dim(X_reduced))
plt.title("tSNE: LOW-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

def KL_gradient(P, Y):
    Y = Y.reshape(n, N_LOW_DIMS)
    Q = prob_low_dim(Y)
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    inv_distances[np.tril_indices(Y.shape[0])] = np.nan
    inv_distances = inv_distances[~np.isnan(inv_distances)]
    PQd = squareform((P - Q) * inv_distances)
    grad = np.ndarray((n, N_LOW_DIMS), dtype = Y.dtype)
    for i in range(n):
        grad[i] = np.dot(np.ravel(PQd[i], order = 'K'), Y[i] - Y)
    return 4 * grad.ravel()

plt.figure(figsize = (20,15))
sns.distplot(KL_gradient(P, X_reduced))
plt.title("tSNE: GRADIENT OF KL-DIVERGENCE", fontsize = 20)
plt.xlabel("GRADIENT", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()

np.random.seed(12345)
y = X_reduced.copy().ravel()
update = np.zeros_like(y)
KL_array = []; KL_gradient_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
    KL_array.append(KL(P, y))
    KL_gradient_array.append(np.linalg.norm(KL_gradient(P, y)))
    #if i % 100 == 0:
    print("Iter = " + str(i) + ", KL divergence = " + str(KL(P, y)) + ", KL-gradient = " + 
          str(np.linalg.norm(KL_gradient(P, y))))
    update = MOMENTUM * update - LEARNING_RATE * KL_gradient(P, y)
    y = y + update
        
plt.figure(figsize = (20,15))
plt.plot(KL_array,'-o')
plt.title("KL-divergence", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
plt.plot(KL_gradient_array,'-o')
plt.title("KL-divergence Gradient", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE GRADIENT", fontsize = 20)
plt.show()

plt.figure(figsize = (20,15))
plt.scatter(y.reshape(n, N_LOW_DIMS)[:,0], y.reshape(n, N_LOW_DIMS)[:,1], 
            c = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa)), 
            cmap = 'tab10', s = 50)
plt.title("tSNE Programmed from Scratch: 2D World Map Embedded into 3D Swiss Roll", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 3) (3023,)


Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.44061425423882394
Running Gradient Descent: 

Iter = 0, KL divergence = 0.19772935156019605, KL-gradient = 0.007856695290534468
Iter = 1, KL divergence = 0.18560179589709513, KL-gradient = 0.0075752202878324485
Iter = 2, KL divergence = 0.16535091532656065, KL-gradient = 0.007028271976442297
Iter = 3, KL divergence = 0.1412752730307692, KL-gradient = 0.006261730272795559
Iter = 4, KL divergence = 0.11716397003628778, KL-gradient = 0.005374782612794075
Iter = 5, KL divergence = 0.09560242272549282, KL-gradient = 0.004481885984126101
Iter = 6, KL divergence = 0.0777973950189654, KL-gradient = 0.003670488444747745
Iter = 7, KL divergence = 0.06388595401674256, KL-gradient = 0.0029858584832976923
Iter = 8, KL divergence = 0.05340527811716026, KL-gradient = 0.0024369270930010858
Iter = 9, KL divergence = 0.04567779468179057, KL-gradient = 0.0020088452903850177
Iter = 10, KL divergence = 0.040039803713954825, KL-gradient = 0.001675161793313108
Iter = 11, KL divergence = 0.03594067530592024, KL-gradient = 0.0014077868500515726
Iter = 12, KL divergence = 0.03296317026114383, KL-gradient = 0.0011837534035642838
Iter = 13, KL divergence = 0.030806160696246903, KL-gradient = 0.0009880648100812845
Iter = 14, KL divergence = 0.02925563455907694, KL-gradient = 0.0008132575107086549
Iter = 15, KL divergence = 0.028157400103616587, KL-gradient = 0.0006573162317186844
Iter = 16, KL divergence = 0.027396668922771213, KL-gradient = 0.0005214124355933729
Iter = 17, KL divergence = 0.026884979087739024, KL-gradient = 0.0004081761555874255
Iter = 18, KL divergence = 0.026552772816101862, KL-gradient = 0.00032045605185200267
Iter = 19, KL divergence = 0.026345456992260333, KL-gradient = 0.0002598723227287698
Iter = 20, KL divergence = 0.026221152900633777, KL-gradient = 0.0002245079221185825
Iter = 21, KL divergence = 0.026148996707661173, KL-gradient = 0.00020748436342135065
Iter = 22, KL divergence = 0.026107448992433856, KL-gradient = 0.00019958310419848028
Iter = 23, KL divergence = 0.02608248124366406, KL-gradient = 0.00019353419771515412
Iter = 24, KL divergence = 0.026065724718138214, KL-gradient = 0.00018554934906102184
Iter = 25, KL divergence = 0.026052739327509025, KL-gradient = 0.00017455269523980715
Iter = 26, KL divergence = 0.026041543530277725, KL-gradient = 0.00016104389141472953
Iter = 27, KL divergence = 0.026031487780527298, KL-gradient = 0.00014627678054760933
Iter = 28, KL divergence = 0.02602248728956464, KL-gradient = 0.0001317306361668007
Iter = 29, KL divergence = 0.026014574867782678, KL-gradient = 0.00011873772564928549
Iter = 30, KL divergence = 0.02600770075977267, KL-gradient = 0.00010818619394967931
Iter = 31, KL divergence = 0.026001694868715806, KL-gradient = 0.00010031888711507228
Iter = 32, KL divergence = 0.025996313459523974, KL-gradient = 9.473032144380565e-05
Iter = 33, KL divergence = 0.025991310557318428, KL-gradient = 9.060518907059343e-05
Iter = 34, KL divergence = 0.025986496692604902, KL-gradient = 8.706611422980004e-05
Iter = 35, KL divergence = 0.025981768563504273, KL-gradient = 8.34365184726139e-05
Iter = 36, KL divergence = 0.025977108934056175, KL-gradient = 7.934082318444595e-05
Iter = 37, KL divergence = 0.025972565301340293, KL-gradient = 7.468320267962217e-05
Iter = 38, KL divergence = 0.025968219038463154, KL-gradient = 6.957234162149551e-05
Iter = 39, KL divergence = 0.025964155491760222, KL-gradient = 6.423580007141732e-05
Iter = 40, KL divergence = 0.025960441908217717, KL-gradient = 5.894273119452577e-05
Iter = 41, KL divergence = 0.025957115916948326, KL-gradient = 5.394102004204619e-05
Iter = 42, KL divergence = 0.025954183857852822, KL-gradient = 4.941145370452214e-05
Iter = 43, KL divergence = 0.025951626158204556, KL-gradient = 4.54422312666106e-05
Iter = 44, KL divergence = 0.02594940625792029, KL-gradient = 4.2027204188040746e-05
Iter = 45, KL divergence = 0.025947479961352923, KL-gradient = 3.908709530534399e-05
Iter = 46, KL divergence = 0.02594580307725542, KL-gradient = 3.65053533444039e-05
Iter = 47, KL divergence = 0.025944336346885977, KL-gradient = 3.416509934126916e-05
Iter = 48, KL divergence = 0.025943047625653852, KL-gradient = 3.197556140067712e-05
Iter = 49, KL divergence = 0.025941911908396016, KL-gradient = 2.9883499947876218e-05
Iter = 50, KL divergence = 0.025940910048104733, KL-gradient = 2.7871722275508755e-05
Iter = 51, KL divergence = 0.025940026984785456, KL-gradient = 2.594970732871086e-05
Iter = 52, KL divergence = 0.02593925009140978, KL-gradient = 2.4141219818044807e-05
Iter = 53, KL divergence = 0.025938567973056405, KL-gradient = 2.2472457435022343e-05
Iter = 54, KL divergence = 0.02593796981051582, KL-gradient = 2.0962949661549265e-05
Iter = 55, KL divergence = 0.025937445169087088, KL-gradient = 1.9620387533955286e-05
Iter = 56, KL divergence = 0.025936984109370464, KL-gradient = 1.843963644001091e-05
Iter = 57, KL divergence = 0.025936577426024173, KL-gradient = 1.740524555974403e-05
Iter = 58, KL divergence = 0.025936216877228378, KL-gradient = 1.6495989574023955e-05
Iter = 59, KL divergence = 0.02593589532411753, KL-gradient = 1.56896921057898e-05
Iter = 60, KL divergence = 0.02593560675442831, KL-gradient = 1.4966922562718059e-05
Iter = 61, KL divergence = 0.025935346204704376, KL-gradient = 1.4312900312386872e-05
Iter = 62, KL divergence = 0.025935109616783983, KL-gradient = 1.3717680952906029e-05
Iter = 63, KL divergence = 0.02593489366846124, KL-gradient = 1.3175160582682532e-05
Iter = 64, KL divergence = 0.025934695611158846, KL-gradient = 1.2681566812509337e-05
Iter = 65, KL divergence = 0.025934513134883028, KL-gradient = 1.223400512993205e-05
Iter = 66, KL divergence = 0.025934344268002393, KL-gradient = 1.1829418206729275e-05
Iter = 67, KL divergence = 0.02593418730971495, KL-gradient = 1.1464087135366047e-05
Iter = 68, KL divergence = 0.025934040787484045, KL-gradient = 1.1133617121310569e-05
Iter = 69, KL divergence = 0.025933903430153532, KL-gradient = 1.0833238686968483e-05
Iter = 70, KL divergence = 0.025933774148617562, KL-gradient = 1.0558226197422597e-05
Iter = 71, KL divergence = 0.025933652018465966, KL-gradient = 1.0304271591926064e-05
Iter = 72, KL divergence = 0.02593353626187043, KL-gradient = 1.006772075503701e-05
Iter = 73, KL divergence = 0.025933426228161957, KL-gradient = 9.845650739074503e-06
Iter = 74, KL divergence = 0.02593332137391916, KL-gradient = 9.635816828692204e-06
Iter = 75, KL divergence = 0.025933221243905014, KL-gradient = 9.436522263379068e-06
Iter = 76, KL divergence = 0.025933125454024236, KL-gradient = 9.24646413541727e-06
Iter = 77, KL divergence = 0.02593303367701989, KL-gradient = 9.064595401973942e-06
Iter = 78, KL divergence = 0.025932945631111863, KL-gradient = 8.890024231255433e-06
Iter = 79, KL divergence = 0.025932861071303114, KL-gradient = 8.721954888735594e-06
Iter = 80, KL divergence = 0.025932779782864842, KL-gradient = 8.559662792376281e-06
Iter = 81, KL divergence = 0.02593270157642888, KL-gradient = 8.402491132532457e-06
Iter = 82, KL divergence = 0.025932626284167475, KL-gradient = 8.249856490683449e-06
Iter = 83, KL divergence = 0.02593255375666624, KL-gradient = 8.101254162886368e-06
Iter = 84, KL divergence = 0.025932483860268828, KL-gradient = 7.95625828531793e-06
Iter = 85, KL divergence = 0.02593241647475525, KL-gradient = 7.814515769384206e-06
Iter = 86, KL divergence = 0.025932351491290438, KL-gradient = 7.675735626820858e-06
Iter = 87, KL divergence = 0.02593228881067748, KL-gradient = 7.53967633433038e-06
Iter = 88, KL divergence = 0.025932228341863726, KL-gradient = 7.406133754413636e-06
Iter = 89, KL divergence = 0.025932170000721094, KL-gradient = 7.274931301843545e-06
Iter = 90, KL divergence = 0.02593211370905121, KL-gradient = 7.145913023458804e-06
Iter = 91, KL divergence = 0.025932059393794324, KL-gradient = 7.01893940256957e-06
Iter = 92, KL divergence = 0.02593200698638671, KL-gradient = 6.8938851861470996e-06
Iter = 93, KL divergence = 0.025931956422239433, KL-gradient = 6.770638379224488e-06
Iter = 94, KL divergence = 0.025931907640305004, KL-gradient = 6.649099668059938e-06
Iter = 95, KL divergence = 0.02593186058269241, KL-gradient = 6.529181791381911e-06
Iter = 96, KL divergence = 0.025931815194351303, KL-gradient = 6.410808656583956e-06
Iter = 97, KL divergence = 0.025931771422774773, KL-gradient = 6.293914213666187e-06
Iter = 98, KL divergence = 0.025931729217747324, KL-gradient = 6.178441218788603e-06
Iter = 99, KL divergence = 0.025931688531122793, KL-gradient = 6.064340046382786e-06
In [183]:
KL_gradient_array_perp100 = KL_gradient_array
In [181]:
KL_gradient_array_perp500 = KL_gradient_array
In [186]:
KL_gradient_array_perp2000 = KL_gradient_array
In [187]:
plt.figure(figsize = (20, 15))

plt.plot(range(MAX_ITER), KL_gradient_array_perp100, '-o')
plt.plot(range(MAX_ITER), KL_gradient_array_perp500, '-o')
plt.plot(range(MAX_ITER), KL_gradient_array_perp2000, '-o')

plt.gca().legend(('Perplexity = 100', 'Perplexity = 500', 'Perplexity = 2000'), fontsize = 20)
plt.title("tSNE: KL-Gradient at Different Perplexities", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-GRADIENT", fontsize = 20)
plt.show()

We can clearly see that the KL-gradient becomes close to zero at large perplexities. However, we have seen that the high-dimensional probabilities are not at all close to 1, so $P\approx 1$ at $\sigma\rightarrow\infty$ is not the reason for disappearence of the KL-gradinet at large perplexities. From the equation of the KL-gradient, only the $P-Q$ difference depends on perplexitiy, indeed the rest y-dpendent factors are not directly sensitive to the increasing of perplexity. Let us prove here that with the increase of perplexity P becomes close to Q.

In [190]:
np.linalg.norm(P)
Out[190]:
0.00031713715663696587
In [191]:
np.linalg.norm(Q)
Out[191]:
0.00024341306471132288
In [203]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from scipy.spatial.distance import squareform
from sklearn.metrics.pairwise import euclidean_distances

N_LOW_DIMS = 2
MOMENTUM = 0.8
MAX_ITER = 100
PERPLEXITY = 2000
LEARNING_RATE = 200

X_train = X_swiss_roll; n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
print('\n')

X_reduced = PCA(n_components = 2).fit_transform(X_train)

def prob_high_dim(sigma, dist_row):
    exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
    exp_distance[dist_row] = 0
    prob_not_symmetr = exp_distance / np.sum(exp_distance)
    return prob_not_symmetr

def perplexity(prob):
    return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))

def sigma_binary_search(perp_of_sigma, fixed_perplexity):
    sigma_lower_limit = 0; sigma_upper_limit = 1000
    for i in range(20):
        approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
        if perp_of_sigma(approx_sigma) < fixed_perplexity:
            sigma_lower_limit = approx_sigma
        else:
            sigma_upper_limit = approx_sigma
        if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
            break
    return approx_sigma

def prob_low_dim(Y):
    Y = Y.reshape(n, N_LOW_DIMS)
    inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
    inv_distances[np.tril_indices(Y.shape[0])] = np.nan
    inv_distances = inv_distances[~np.isnan(inv_distances)]
    return inv_distances / (2 * np.sum(inv_distances))

Q = prob_low_dim(X_reduced)
P2Q = []; Pmax = []; Pmedian = []
for PERP in [20, 50, 100, 300, 500, 800, 1000, 1500, 2000, 2500]:
    print('Working with Perplexity = {}'.format(PERP))
    prob = np.zeros((n,n)); sigma_array = []
    for dist_row in range(n):
        func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
        binary_search_result = sigma_binary_search(func, PERP)
        prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
        sigma_array.append(binary_search_result)
        if (dist_row + 1) % 1000 == 0:
            print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
    print("\nMean sigma = " + str(np.mean(sigma_array)))

    P = (prob + np.transpose(prob)) / (2 * n)
    P[np.tril_indices(P.shape[0])] = np.nan
    P = P[~np.isnan(P)]
    
    Pmax.append(np.max(P)); Pmedian.append(np.median(P))

    print('Perplexity = {0}, P / Q = {1}, P_max = {2}, P_median = {3} \n'.
          format(PERP, np.linalg.norm(P) / np.linalg.norm(Q), np.max(P), np.median(P)))
    P2Q.append(np.linalg.norm(P) / np.linalg.norm(Q))
    print('******************************************************** \n')
This data set contains 3023 samples

Dimensions of the  data set: 
(3023, 3) (3023,)


Working with Perplexity = 20
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.01579414642768075
Perplexity = 20, P / Q = 13.533318180666694, P_max = 8.236893624212539e-05, P_median = 0.0 

********************************************************
Working with Perplexity = 50
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.028473629763770728
Perplexity = 50, P / Q = 8.464993097923548, P_max = 3.360856310665985e-05, P_median = 5.869031028735738e-122 

********************************************************
Working with Perplexity = 100
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.04465423038847353
Perplexity = 100, P / Q = 5.957634061145452, P_max = 1.4208407895987522e-05, P_median = 1.2352300190021426e-50 

********************************************************
Working with Perplexity = 300
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.09404339224070538
Perplexity = 300, P / Q = 3.373563366977943, P_max = 5.817746175862234e-06, P_median = 2.128615572758489e-16 

********************************************************
Working with Perplexity = 500
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.1422154599769282
Perplexity = 500, P / Q = 2.583821927384243, P_max = 3.9722536120004726e-06, P_median = 1.7923053245148134e-11 

********************************************************
Working with Perplexity = 800
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.21545373561403589
Perplexity = 800, P / Q = 2.0347830243394944, P_max = 2.3571069709514343e-06, P_median = 2.1657060001562403e-09 

********************************************************
Working with Perplexity = 1000
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.26403213233455614
Perplexity = 1000, P / Q = 1.8304135132094073, P_max = 1.6685450469160157e-06, P_median = 8.42852745651713e-09 

********************************************************
Working with Perplexity = 1500
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.3539425152404159
Perplexity = 1500, P / Q = 1.5196832378806506, P_max = 7.651740798483235e-07, P_median = 3.548724505630782e-08 

********************************************************
Working with Perplexity = 2000
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.44061425423882394
Perplexity = 2000, P / Q = 1.3028764787669735, P_max = 4.831596818522594e-07, P_median = 6.288951682389261e-08 

********************************************************
Working with Perplexity = 2500
Sigma binary search finished 1000 of 3023 cells
Sigma binary search finished 2000 of 3023 cells
Sigma binary search finished 3000 of 3023 cells

Mean sigma = 0.5673800641008453
Perplexity = 2500, P / Q = 1.1250631615619726, P_max = 3.3010452756750126e-07, P_median = 8.665779602776437e-08 

********************************************************
In [204]:
plt.figure(figsize=(20,15))
plt.plot([20, 50, 100, 300, 500, 800, 1000, 1500, 2000, 2500], P2Q, '-o')
plt.title("tSNE: P / Q vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20)
plt.ylabel("P / Q", fontsize = 20)
plt.show()

Here we confirm that overall (norm) P becomes suspicously similar to overall (norm) Q at large perplexities, therefore their difference P-Q goes to zero and the KL-gradient disappears. It implies that the attractive forces in the KL-gradient, i.e. the P-term, becomes balannced by the repulsive forces, i.e. the Q-term in the KL-gradient, and the gradient descent never starts properly becomes when attraction is equal to repulsion, the data points do not move in space since the gradient is zero, i.e. effectively no forces act on them. The next two plots below demonstrate what is going on with the distribution of the high-dimensional probabilities when perplexity grows, for this purpose we monitor the maximum value of the high-dimensional probabilities in order to see the spread of their values, and the median of the high-dimensional probabilities in order to see how the overall distribution moves when perplexity grows.

In [205]:
plt.figure(figsize=(20,15))
plt.plot([20, 50, 100, 300, 500, 800, 1000, 1500, 2000, 2500], Pmax, '-o')
plt.title("tSNE: MAX(P) vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20)
plt.ylabel("MAX(P)", fontsize = 20)
plt.show()
In [206]:
plt.figure(figsize=(20,15))
plt.plot([20, 50, 100, 300, 500, 800, 1000, 1500, 2000, 2500], Pmedian, '-o')
plt.title("tSNE: MEDIAN(P) vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20)
plt.ylabel("MEDIAN(P)", fontsize = 20)
plt.show()

Here we see an interesting thing. The maximum values of the high-dimensional probabilities decreases and the median value increases when perplexity grows. This implies that at low perplexities, there were a few pairs of points with very high probabilities to observe their distances, while the majority of points had almost a zero probability to obsreve them at certain distance. With the increase of perplexity, the high-dimensional probabilities for different data points become more and more similar to each other, i.e. they all increase while there no more large values of the probabilities, but they all are more or less small although not zero.

Now let us understand what exactly it means that P becomes close to Q. We know that the Q-probability has a Student t-distribution shape with the heaver tails at large distances than those of the Gaussian distribution. However, as we can see below, when $\sigma$ increase, or perplexity grows, the tails of the two distributions become comparable, so the effect of the heavy tails of the Student t-distribution disappears at large perplexity.

In [223]:
sigma = 2.8
z = np.linspace(0., 10, 1000)
gauss = np.exp(-z**2 / sigma**2)
cauchy = 1/(1+z**2)

plt.figure(figsize=(20,15))
plt.plot(z, gauss, label='Gaussian distribution')
plt.plot(z, cauchy, label='Cauchy distribution')
plt.legend()
plt.show()

Can we help P to be larger then Q at the initialization stage? Yes, if you remember, there is the so-called early exaggeration parameter $\alpha$ that is just a factor that is $\alpha P$ instead of just P is used at the initialization stage in order to avoid the problem of disappearing gradient. Hence the condition when the tSNE algorithm is not properly initialized, i.e. when the KL-gradient becomes close to zero is:

$$\frac{\alpha}{N} e^{-\displaystyle\frac{X^2}{2\sigma^2}} = \frac{1}{1+Y^2}$$

,

where $\alpha$ is the early exaggeration parameter. Here we simply multiply the high-dimensional probability by $\alpha$ and divide it by the number of samples, remember the symmetry condition of the high-dimensional probabilities from the original tSNE algorithm. Let us first check that increasing early exaggeration can help to properly initialize the tSNE. As we remember, it failed previously with init = PCA, perplexity = 2000 and learning_rate = 200, the default perplexity was 12, let us increase it to 24 and then to 500.

In [10]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
y_train = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 2000, 
             n_iter = 1000, verbose = 2, early_exaggeration = 500, method = 'exact', init = 'random')
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y_train, s = 50)
plt.title('tSNE', fontsize = 20); plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 1000 / 3023
[t-SNE] Computed conditional probabilities for sample 2000 / 3023
[t-SNE] Computed conditional probabilities for sample 3000 / 3023
[t-SNE] Computed conditional probabilities for sample 3023 / 3023
[t-SNE] Mean sigma: 0.615106
[t-SNE] Iteration 50: error = 3856.7940872, gradient norm = 3.2316799 (50 iterations in 10.398s)
[t-SNE] Iteration 100: error = 3762.6112839, gradient norm = 3.9773443 (50 iterations in 10.380s)
[t-SNE] Iteration 150: error = 3796.7606694, gradient norm = 3.6208150 (50 iterations in 10.366s)
[t-SNE] Iteration 200: error = 3860.1249547, gradient norm = 3.5382659 (50 iterations in 10.386s)
[t-SNE] Iteration 250: error = 3843.3508896, gradient norm = 4.0546551 (50 iterations in 10.413s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 3843.350890
[t-SNE] Iteration 300: error = 0.9186637, gradient norm = 0.0024350 (50 iterations in 10.436s)
[t-SNE] Iteration 350: error = 0.1149457, gradient norm = 0.0022851 (50 iterations in 10.495s)
[t-SNE] Iteration 400: error = 0.0259746, gradient norm = 0.0001185 (50 iterations in 10.441s)
[t-SNE] Iteration 450: error = 0.0259330, gradient norm = 0.0000021 (50 iterations in 10.715s)
[t-SNE] Iteration 500: error = 0.0259330, gradient norm = 0.0000000 (50 iterations in 11.631s)
[t-SNE] Iteration 500: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 500 iterations: 0.025933

It does not look like increasing the early exaggeration parameter helps to better initialize the tSNE algorithm. An initeresting observation here is that since we embedded the 2D World Map into 3D Swiss Roll, the intrinsic dimension of the manifold is still 2D and the distances between the data points on the Swiss Roll seem to be preserved, i.e. the distributions of pairwise distances between the points in 3D and 2D are very similar.

In [235]:
dist_high_dim = np.square(euclidean_distances(X_swiss_roll, X_swiss_roll))
plt.figure(figsize = (20,15))
sns.distplot(dist_high_dim.reshape(-1,1))
plt.title("HIGH-DIMENSIONAL EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
In [234]:
dist_low_dim = np.square(euclidean_distances(X_reduced, X_reduced))
plt.figure(figsize = (20,15))
sns.distplot(dist_low_dim.reshape(-1,1))
plt.title("LOW-DIMENSIONAL EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
In [236]:
np.median(dist_high_dim.reshape(-1,1))
Out[236]:
0.5982135076148782
In [237]:
np.median(dist_low_dim.reshape(-1,1))
Out[237]:
0.5842699351000131

Let us now plot how P /Q changes as perplexity grows for different early exaggeration parameter values. We aim here to suggest a simple way to estimate whether the tSNE algorithm is initialized properly depending on the perplexity and early exaggeration parameters.

In [12]:
from sklearn.decomposition import PCA
from scipy.spatial.distance import pdist
from sklearn.manifold._t_sne import _joint_probabilities
from sklearn.metrics.pairwise import euclidean_distances

X_train = X_swiss_roll

X_reduced = PCA(n_components = 2).fit_transform(X_train)
dist = np.square(euclidean_distances(X_train, X_train))

degrees_of_freedom = 1
PERP_START = 10; PERP_STEP = 20
MACHINE_EPSILON = np.finfo(np.double).eps

P2Q_list = []
for EARLY_EXAGGERATION in [1, 4, 12, 24]:
    print('Working with early_exaggeration = {}'.format(EARLY_EXAGGERATION))
    P2Q = []
    for PERP in range(PERP_START, X_train.shape[0], PERP_STEP):
        P = _joint_probabilities(distances = dist, desired_perplexity = PERP, verbose = 0)
        P = EARLY_EXAGGERATION * P
    
        X_embedded = X_reduced.reshape(X_reduced.shape[0], 2)
        inv_dist = pdist(X_embedded, "sqeuclidean")
        inv_dist /= degrees_of_freedom
        inv_dist += 1.
        inv_dist **= (degrees_of_freedom + 1.0) / -2.0
        Q = np.maximum(inv_dist / (2.0 * np.sum(inv_dist)), MACHINE_EPSILON)
    
        P2Q.append(np.linalg.norm(P) / np.linalg.norm(Q))
        print('Perplexity = {0}, P / Q = {1}'.format(PERP, np.linalg.norm(P) / np.linalg.norm(Q)))
    P2Q_list.append(P2Q)
    print('***********************************************************************\n')

plt.figure(figsize = (20, 15))
for i in range(len([1, 4, 12, 24])):
    plt.plot(range(PERP_START, X_train.shape[0], PERP_STEP), P2Q_list[i], '-o')

plt.hlines(1, PERP_START, X_train.shape[0], colors = 'red')
plt.gca().legend(('Early Exaggeration = 1', 'Early Exaggeration = 4', 
                  'Early Exaggeration = 12', 'Early Exaggeration = 24'), fontsize = 20)
plt.title("tSNE: P / Q at Different Perplexities and Early Exaggerations", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("P / Q", fontsize = 20)
plt.show()
Working with early_exaggeration = 1
Perplexity = 10, P / Q = 19.1282720951902
Perplexity = 30, P / Q = 10.979195812646758
Perplexity = 50, P / Q = 8.462302312217563
Perplexity = 70, P / Q = 7.1328866463640015
Perplexity = 90, P / Q = 6.282481816532622
Perplexity = 110, P / Q = 5.676144785501876
Perplexity = 130, P / Q = 5.213608817630593
Perplexity = 150, P / Q = 4.844816473092694
Perplexity = 170, P / Q = 4.541485450228636
Perplexity = 190, P / Q = 4.286227533592643
Perplexity = 210, P / Q = 4.0676952506329975
Perplexity = 230, P / Q = 3.8781268836912055
Perplexity = 250, P / Q = 3.711954527834804
Perplexity = 270, P / Q = 3.564961577059985
Perplexity = 290, P / Q = 3.433814514782828
Perplexity = 310, P / Q = 3.3158523792188057
Perplexity = 330, P / Q = 3.2089832646076695
Perplexity = 350, P / Q = 3.1115716338596746
Perplexity = 370, P / Q = 3.02232656654144
Perplexity = 390, P / Q = 2.9402105779924033
Perplexity = 410, P / Q = 2.864370167700402
Perplexity = 430, P / Q = 2.7940889218405203
Perplexity = 450, P / Q = 2.728758166764317
Perplexity = 470, P / Q = 2.667857197569975
Perplexity = 490, P / Q = 2.6109337788106357
Perplexity = 510, P / Q = 2.5575925802974298
Perplexity = 530, P / Q = 2.5074814316721685
Perplexity = 550, P / Q = 2.460289122036339
Perplexity = 570, P / Q = 2.4157367520033355
Perplexity = 590, P / Q = 2.3735782081579613
Perplexity = 610, P / Q = 2.3335971832605904
Perplexity = 630, P / Q = 2.2956041747692315
Perplexity = 650, P / Q = 2.2594347230343472
Perplexity = 670, P / Q = 2.2249488154910364
Perplexity = 690, P / Q = 2.192021794876077
Perplexity = 710, P / Q = 2.1605502896679383
Perplexity = 730, P / Q = 2.1304452286400255
Perplexity = 750, P / Q = 2.1016306041498516
Perplexity = 770, P / Q = 2.074041616871832
Perplexity = 790, P / Q = 2.047625058235911
Perplexity = 810, P / Q = 2.0223351457017826
Perplexity = 830, P / Q = 1.9981342119557814
Perplexity = 850, P / Q = 1.9749878791612325
Perplexity = 870, P / Q = 1.9528645183308888
Perplexity = 890, P / Q = 1.931725705625745
Perplexity = 910, P / Q = 1.9115258651675775
Perplexity = 930, P / Q = 1.892202177870157
Perplexity = 950, P / Q = 1.873681567243586
Perplexity = 970, P / Q = 1.8558894947979763
Perplexity = 990, P / Q = 1.8387539554188927
Perplexity = 1010, P / Q = 1.8222131821486887
Perplexity = 1030, P / Q = 1.806211707425288
Perplexity = 1050, P / Q = 1.7907022607131813
Perplexity = 1070, P / Q = 1.775644312437575
Perplexity = 1090, P / Q = 1.761002106843922
Perplexity = 1110, P / Q = 1.7467449795207732
Perplexity = 1130, P / Q = 1.7328453604682572
Perplexity = 1150, P / Q = 1.7192790233776694
Perplexity = 1170, P / Q = 1.7060236098508728
Perplexity = 1190, P / Q = 1.6930608528800524
Perplexity = 1210, P / Q = 1.6803717835518759
Perplexity = 1230, P / Q = 1.6679415821144263
Perplexity = 1250, P / Q = 1.6557552987208959
Perplexity = 1270, P / Q = 1.6437997353380733
Perplexity = 1290, P / Q = 1.6320631401680088
Perplexity = 1310, P / Q = 1.6205335634703972
Perplexity = 1330, P / Q = 1.6092019983220451
Perplexity = 1350, P / Q = 1.59805785835681
Perplexity = 1370, P / Q = 1.5870928528183785
Perplexity = 1390, P / Q = 1.576299148519869
Perplexity = 1410, P / Q = 1.5656687413993629
Perplexity = 1430, P / Q = 1.5551949374817864
Perplexity = 1450, P / Q = 1.5448707442324565
Perplexity = 1470, P / Q = 1.534690070691306
Perplexity = 1490, P / Q = 1.524647533676093
Perplexity = 1510, P / Q = 1.5147377083823648
Perplexity = 1530, P / Q = 1.5049547624073276
Perplexity = 1550, P / Q = 1.4952948918753324
Perplexity = 1570, P / Q = 1.4857530058766775
Perplexity = 1590, P / Q = 1.4763249906127829
Perplexity = 1610, P / Q = 1.4670072295729932
Perplexity = 1630, P / Q = 1.457795331234069
Perplexity = 1650, P / Q = 1.4486857103016217
Perplexity = 1670, P / Q = 1.4396752360233684
Perplexity = 1690, P / Q = 1.4307610318941084
Perplexity = 1710, P / Q = 1.4219391406932052
Perplexity = 1730, P / Q = 1.413207224703216
Perplexity = 1750, P / Q = 1.4045628144040256
Perplexity = 1770, P / Q = 1.3960024732916136
Perplexity = 1790, P / Q = 1.3875244631607184
Perplexity = 1810, P / Q = 1.3791254191680764
Perplexity = 1830, P / Q = 1.370803675482839
Perplexity = 1850, P / Q = 1.3625566401141551
Perplexity = 1870, P / Q = 1.354382883605736
Perplexity = 1890, P / Q = 1.3462798745548734
Perplexity = 1910, P / Q = 1.3382454097755068
Perplexity = 1930, P / Q = 1.3302782225035286
Perplexity = 1950, P / Q = 1.3223755477091748
Perplexity = 1970, P / Q = 1.3145369223680536
Perplexity = 1990, P / Q = 1.306759369887059
Perplexity = 2010, P / Q = 1.2990423865262222
Perplexity = 2030, P / Q = 1.2913835515968048
Perplexity = 2050, P / Q = 1.2837814570752875
Perplexity = 2070, P / Q = 1.2762349530563804
Perplexity = 2090, P / Q = 1.2687423054059679
Perplexity = 2110, P / Q = 1.2613025862752365
Perplexity = 2130, P / Q = 1.253913961833971
Perplexity = 2150, P / Q = 1.2465755259389562
Perplexity = 2170, P / Q = 1.2392855783809025
Perplexity = 2190, P / Q = 1.2320433413733394
Perplexity = 2210, P / Q = 1.2248468694713646
Perplexity = 2230, P / Q = 1.2176960679931297
Perplexity = 2250, P / Q = 1.2105891397696573
Perplexity = 2270, P / Q = 1.2035252823047329
Perplexity = 2290, P / Q = 1.1965031798038874
Perplexity = 2310, P / Q = 1.1895221176220068
Perplexity = 2330, P / Q = 1.182580998294008
Perplexity = 2350, P / Q = 1.1756782447919956
Perplexity = 2370, P / Q = 1.1688141012151283
Perplexity = 2390, P / Q = 1.1619867454226749
Perplexity = 2410, P / Q = 1.155195431670222
Perplexity = 2430, P / Q = 1.1484391865142414
Perplexity = 2450, P / Q = 1.1417175523974277
Perplexity = 2470, P / Q = 1.1350296364449173
Perplexity = 2490, P / Q = 1.1283748365805775
Perplexity = 2510, P / Q = 1.121751267418751
Perplexity = 2530, P / Q = 1.1151591754394057
Perplexity = 2550, P / Q = 1.1085977778011034
Perplexity = 2570, P / Q = 1.1020659737357348
Perplexity = 2590, P / Q = 1.0955633311183215
Perplexity = 2610, P / Q = 1.089089078255003
Perplexity = 2630, P / Q = 1.082642513185285
Perplexity = 2650, P / Q = 1.0762231450223994
Perplexity = 2670, P / Q = 1.069829727370455
Perplexity = 2690, P / Q = 1.063462547190929
Perplexity = 2710, P / Q = 1.057120298079619
Perplexity = 2730, P / Q = 1.050803105202597
Perplexity = 2750, P / Q = 1.0445097665454615
Perplexity = 2770, P / Q = 1.0382403583482267
Perplexity = 2790, P / Q = 1.0319936009204294
Perplexity = 2810, P / Q = 1.025769515021011
Perplexity = 2830, P / Q = 1.0195680962113922
Perplexity = 2850, P / Q = 1.0133879299912747
Perplexity = 2870, P / Q = 1.0072293939265042
Perplexity = 2890, P / Q = 1.0010917044841754
Perplexity = 2910, P / Q = 0.9949749604289573
Perplexity = 2930, P / Q = 0.9888790217503352
Perplexity = 2950, P / Q = 0.9828033861071531
Perplexity = 2970, P / Q = 0.9767487376934801
Perplexity = 2990, P / Q = 0.9707158048028935
Perplexity = 3010, P / Q = 0.9647063021996974
***********************************************************************

Working with early_exaggeration = 4
Perplexity = 10, P / Q = 76.5130883807608
Perplexity = 30, P / Q = 43.91678325058703
Perplexity = 50, P / Q = 33.84920924887025
Perplexity = 70, P / Q = 28.531546585456006
Perplexity = 90, P / Q = 25.129927266130487
Perplexity = 110, P / Q = 22.704579142007503
Perplexity = 130, P / Q = 20.854435270522373
Perplexity = 150, P / Q = 19.379265892370775
Perplexity = 170, P / Q = 18.165941800914545
Perplexity = 190, P / Q = 17.144910134370573
Perplexity = 210, P / Q = 16.27078100253199
Perplexity = 230, P / Q = 15.512507534764822
Perplexity = 250, P / Q = 14.847818111339215
Perplexity = 270, P / Q = 14.25984630823994
Perplexity = 290, P / Q = 13.735258059131311
Perplexity = 310, P / Q = 13.263409516875223
Perplexity = 330, P / Q = 12.835933058430678
Perplexity = 350, P / Q = 12.446286535438698
Perplexity = 370, P / Q = 12.08930626616576
Perplexity = 390, P / Q = 11.760842311969613
Perplexity = 410, P / Q = 11.457480670801608
Perplexity = 430, P / Q = 11.176355687362081
Perplexity = 450, P / Q = 10.915032667057268
Perplexity = 470, P / Q = 10.6714287902799
Perplexity = 490, P / Q = 10.443735115242543
Perplexity = 510, P / Q = 10.230370321189719
Perplexity = 530, P / Q = 10.029925726688674
Perplexity = 550, P / Q = 9.841156488145357
Perplexity = 570, P / Q = 9.662947008013342
Perplexity = 590, P / Q = 9.494312832631845
Perplexity = 610, P / Q = 9.334388733042362
Perplexity = 630, P / Q = 9.182416699076926
Perplexity = 650, P / Q = 9.037738892137389
Perplexity = 670, P / Q = 8.899795261964146
Perplexity = 690, P / Q = 8.768087179504308
Perplexity = 710, P / Q = 8.642201158671753
Perplexity = 730, P / Q = 8.521780914560102
Perplexity = 750, P / Q = 8.406522416599406
Perplexity = 770, P / Q = 8.296166467487328
Perplexity = 790, P / Q = 8.190500232943643
Perplexity = 810, P / Q = 8.08934058280713
Perplexity = 830, P / Q = 7.992536847823126
Perplexity = 850, P / Q = 7.89995151664493
Perplexity = 870, P / Q = 7.811458073323555
Perplexity = 890, P / Q = 7.72690282250298
Perplexity = 910, P / Q = 7.64610346067031
Perplexity = 930, P / Q = 7.568808711480628
Perplexity = 950, P / Q = 7.494726268974344
Perplexity = 970, P / Q = 7.423557979191905
Perplexity = 990, P / Q = 7.355015821675571
Perplexity = 1010, P / Q = 7.288852728594755
Perplexity = 1030, P / Q = 7.224846829701152
Perplexity = 1050, P / Q = 7.162809042852725
Perplexity = 1070, P / Q = 7.1025772497503
Perplexity = 1090, P / Q = 7.044008427375688
Perplexity = 1110, P / Q = 6.986979918083093
Perplexity = 1130, P / Q = 6.931381441873029
Perplexity = 1150, P / Q = 6.877116093510677
Perplexity = 1170, P / Q = 6.824094439403491
Perplexity = 1190, P / Q = 6.7722434115202095
Perplexity = 1210, P / Q = 6.7214871342075035
Perplexity = 1230, P / Q = 6.671766328457705
Perplexity = 1250, P / Q = 6.6230211948835835
Perplexity = 1270, P / Q = 6.575198941352293
Perplexity = 1290, P / Q = 6.528252560672035
Perplexity = 1310, P / Q = 6.482134253881589
Perplexity = 1330, P / Q = 6.4368079932881805
Perplexity = 1350, P / Q = 6.39223143342724
Perplexity = 1370, P / Q = 6.348371411273514
Perplexity = 1390, P / Q = 6.305196594079476
Perplexity = 1410, P / Q = 6.262674965597451
Perplexity = 1430, P / Q = 6.2207797499271456
Perplexity = 1450, P / Q = 6.179482976929826
Perplexity = 1470, P / Q = 6.138760282765224
Perplexity = 1490, P / Q = 6.098590134704372
Perplexity = 1510, P / Q = 6.058950833529459
Perplexity = 1530, P / Q = 6.0198190496293105
Perplexity = 1550, P / Q = 5.9811795675013295
Perplexity = 1570, P / Q = 5.94301202350671
Perplexity = 1590, P / Q = 5.9052999624511315
Perplexity = 1610, P / Q = 5.868028918291973
Perplexity = 1630, P / Q = 5.831181324936276
Perplexity = 1650, P / Q = 5.794742841206487
Perplexity = 1670, P / Q = 5.758700944093474
Perplexity = 1690, P / Q = 5.723044127576434
Perplexity = 1710, P / Q = 5.687756562772821
Perplexity = 1730, P / Q = 5.652828898812864
Perplexity = 1750, P / Q = 5.618251257616103
Perplexity = 1770, P / Q = 5.584009893166455
Perplexity = 1790, P / Q = 5.550097852642874
Perplexity = 1810, P / Q = 5.516501676672306
Perplexity = 1830, P / Q = 5.483214701931356
Perplexity = 1850, P / Q = 5.4502265604566205
Perplexity = 1870, P / Q = 5.417531534422944
Perplexity = 1890, P / Q = 5.385119498219494
Perplexity = 1910, P / Q = 5.352981639102027
Perplexity = 1930, P / Q = 5.321112890014114
Perplexity = 1950, P / Q = 5.289502190836699
Perplexity = 1970, P / Q = 5.258147689472215
Perplexity = 1990, P / Q = 5.227037479548236
Perplexity = 2010, P / Q = 5.196169546104889
Perplexity = 2030, P / Q = 5.165534206387219
Perplexity = 2050, P / Q = 5.13512582830115
Perplexity = 2070, P / Q = 5.104939812225521
Perplexity = 2090, P / Q = 5.0749692216238715
Perplexity = 2110, P / Q = 5.045210345100946
Perplexity = 2130, P / Q = 5.015655847335884
Perplexity = 2150, P / Q = 4.986302103755825
Perplexity = 2170, P / Q = 4.95714231352361
Perplexity = 2190, P / Q = 4.9281733654933575
Perplexity = 2210, P / Q = 4.899387477885458
Perplexity = 2230, P / Q = 4.870784271972519
Perplexity = 2250, P / Q = 4.842356559078629
Perplexity = 2270, P / Q = 4.8141011292189315
Perplexity = 2290, P / Q = 4.78601271921555
Perplexity = 2310, P / Q = 4.758088470488027
Perplexity = 2330, P / Q = 4.730323993176032
Perplexity = 2350, P / Q = 4.702712979167982
Perplexity = 2370, P / Q = 4.675256404860513
Perplexity = 2390, P / Q = 4.6479469816906995
Perplexity = 2410, P / Q = 4.620781726680888
Perplexity = 2430, P / Q = 4.5937567460569655
Perplexity = 2450, P / Q = 4.566870209589711
Perplexity = 2470, P / Q = 4.540118545779669
Perplexity = 2490, P / Q = 4.51349934632231
Perplexity = 2510, P / Q = 4.487005069675004
Perplexity = 2530, P / Q = 4.460636701757623
Perplexity = 2550, P / Q = 4.434391111204413
Perplexity = 2570, P / Q = 4.408263894942939
Perplexity = 2590, P / Q = 4.382253324473286
Perplexity = 2610, P / Q = 4.356356313020012
Perplexity = 2630, P / Q = 4.33057005274114
Perplexity = 2650, P / Q = 4.304892580089597
Perplexity = 2670, P / Q = 4.27931890948182
Perplexity = 2690, P / Q = 4.253850188763716
Perplexity = 2710, P / Q = 4.228481192318476
Perplexity = 2730, P / Q = 4.203212420810388
Perplexity = 2750, P / Q = 4.178039066181846
Perplexity = 2770, P / Q = 4.152961433392907
Perplexity = 2790, P / Q = 4.127974403681717
Perplexity = 2810, P / Q = 4.103078060084044
Perplexity = 2830, P / Q = 4.078272384845569
Perplexity = 2850, P / Q = 4.053551719965099
Perplexity = 2870, P / Q = 4.028917575706017
Perplexity = 2890, P / Q = 4.004366817936702
Perplexity = 2910, P / Q = 3.979899841715829
Perplexity = 2930, P / Q = 3.9555160870013406
Perplexity = 2950, P / Q = 3.9312135444286125
Perplexity = 2970, P / Q = 3.9069949507739206
Perplexity = 2990, P / Q = 3.882863219211574
Perplexity = 3010, P / Q = 3.8588252087987898
***********************************************************************

Working with early_exaggeration = 12
Perplexity = 10, P / Q = 229.53926514228243
Perplexity = 30, P / Q = 131.75034975176106
Perplexity = 50, P / Q = 101.54762774661076
Perplexity = 70, P / Q = 85.59463975636803
Perplexity = 90, P / Q = 75.38978179839145
Perplexity = 110, P / Q = 68.11373742602248
Perplexity = 130, P / Q = 62.563305811567126
Perplexity = 150, P / Q = 58.13779767711235
Perplexity = 170, P / Q = 54.497825402743615
Perplexity = 190, P / Q = 51.43473040311172
Perplexity = 210, P / Q = 48.81234300759594
Perplexity = 230, P / Q = 46.53752260429446
Perplexity = 250, P / Q = 44.54345433401767
Perplexity = 270, P / Q = 42.77953892471984
Perplexity = 290, P / Q = 41.20577417739391
Perplexity = 310, P / Q = 39.79022855062569
Perplexity = 330, P / Q = 38.50779917529201
Perplexity = 350, P / Q = 37.3388596063161
Perplexity = 370, P / Q = 36.26791879849727
Perplexity = 390, P / Q = 35.282526935908834
Perplexity = 410, P / Q = 34.37244201240483
Perplexity = 430, P / Q = 33.52906706208625
Perplexity = 450, P / Q = 32.74509800117179
Perplexity = 470, P / Q = 32.01428637083972
Perplexity = 490, P / Q = 31.33120534572762
Perplexity = 510, P / Q = 30.691110963569184
Perplexity = 530, P / Q = 30.089777180066047
Perplexity = 550, P / Q = 29.523469464436094
Perplexity = 570, P / Q = 28.988841024040035
Perplexity = 590, P / Q = 28.482938497895542
Perplexity = 610, P / Q = 28.003166199127104
Perplexity = 630, P / Q = 27.547250097230762
Perplexity = 650, P / Q = 27.11321667641217
Perplexity = 670, P / Q = 26.699385785892446
Perplexity = 690, P / Q = 26.304261538512947
Perplexity = 710, P / Q = 25.92660347601527
Perplexity = 730, P / Q = 25.565342743680354
Perplexity = 750, P / Q = 25.219567249798192
Perplexity = 770, P / Q = 24.88849940246197
Perplexity = 790, P / Q = 24.57150069883091
Perplexity = 810, P / Q = 24.26802174842142
Perplexity = 830, P / Q = 23.977610543469385
Perplexity = 850, P / Q = 23.69985454993479
Perplexity = 870, P / Q = 23.43437421997071
Perplexity = 890, P / Q = 23.180708467508953
Perplexity = 910, P / Q = 22.93831038201094
Perplexity = 930, P / Q = 22.706426134441863
Perplexity = 950, P / Q = 22.48417880692302
Perplexity = 970, P / Q = 22.27067393757568
Perplexity = 990, P / Q = 22.065047465026716
Perplexity = 1010, P / Q = 21.86655818578426
Perplexity = 1030, P / Q = 21.674540489103446
Perplexity = 1050, P / Q = 21.488427128558172
Perplexity = 1070, P / Q = 21.307731749250895
Perplexity = 1090, P / Q = 21.13202528212707
Perplexity = 1110, P / Q = 20.9609397542493
Perplexity = 1130, P / Q = 20.794144325619122
Perplexity = 1150, P / Q = 20.631348280532016
Perplexity = 1170, P / Q = 20.472283318210476
Perplexity = 1190, P / Q = 20.316730234560637
Perplexity = 1210, P / Q = 20.164461402622486
Perplexity = 1230, P / Q = 20.01529898537313
Perplexity = 1250, P / Q = 19.86906358465075
Perplexity = 1270, P / Q = 19.725596824056883
Perplexity = 1290, P / Q = 19.58475768201611
Perplexity = 1310, P / Q = 19.446402761644766
Perplexity = 1330, P / Q = 19.310423979864513
Perplexity = 1350, P / Q = 19.176694300281714
Perplexity = 1370, P / Q = 19.045114233820538
Perplexity = 1390, P / Q = 18.915589782238445
Perplexity = 1410, P / Q = 18.788024896792347
Perplexity = 1430, P / Q = 18.662339249781432
Perplexity = 1450, P / Q = 18.538448930789478
Perplexity = 1470, P / Q = 18.416280848295663
Perplexity = 1490, P / Q = 18.295770404113117
Perplexity = 1510, P / Q = 18.176852500588357
Perplexity = 1530, P / Q = 18.05945714888792
Perplexity = 1550, P / Q = 17.94353870250399
Perplexity = 1570, P / Q = 17.829036070520118
Perplexity = 1590, P / Q = 17.715899887353384
Perplexity = 1610, P / Q = 17.604086754875915
Perplexity = 1630, P / Q = 17.493543974808823
Perplexity = 1650, P / Q = 17.38422852361946
Perplexity = 1670, P / Q = 17.27610283228043
Perplexity = 1690, P / Q = 17.16913238272931
Perplexity = 1710, P / Q = 17.063269688318467
Perplexity = 1730, P / Q = 16.958486696438598
Perplexity = 1750, P / Q = 16.854753772848305
Perplexity = 1770, P / Q = 16.752029679499355
Perplexity = 1790, P / Q = 16.650293557928613
Perplexity = 1810, P / Q = 16.549505030016928
Perplexity = 1830, P / Q = 16.44964410579406
Perplexity = 1850, P / Q = 16.35067968136986
Perplexity = 1870, P / Q = 16.252594603268825
Perplexity = 1890, P / Q = 16.15535849465849
Perplexity = 1910, P / Q = 16.05894491730608
Perplexity = 1930, P / Q = 15.96333867004232
Perplexity = 1950, P / Q = 15.868506572510086
Perplexity = 1970, P / Q = 15.77444306841663
Perplexity = 1990, P / Q = 15.681112438644716
Perplexity = 2010, P / Q = 15.58850863831467
Perplexity = 2030, P / Q = 15.496602619161665
Perplexity = 2050, P / Q = 15.405377484903441
Perplexity = 2070, P / Q = 15.314819436676547
Perplexity = 2090, P / Q = 15.22490766487162
Perplexity = 2110, P / Q = 15.135631035302843
Perplexity = 2130, P / Q = 15.04696754200766
Perplexity = 2150, P / Q = 14.958906311267473
Perplexity = 2170, P / Q = 14.871426940570839
Perplexity = 2190, P / Q = 14.784520096480074
Perplexity = 2210, P / Q = 14.698162433656371
Perplexity = 2230, P / Q = 14.61235281591756
Perplexity = 2250, P / Q = 14.527069677235906
Perplexity = 2270, P / Q = 14.442303387656782
Perplexity = 2290, P / Q = 14.358038157646636
Perplexity = 2310, P / Q = 14.274265411464086
Perplexity = 2330, P / Q = 14.190971979528099
Perplexity = 2350, P / Q = 14.108138937503945
Perplexity = 2370, P / Q = 14.025769214581533
Perplexity = 2390, P / Q = 13.943840945072099
Perplexity = 2410, P / Q = 13.862345180042668
Perplexity = 2430, P / Q = 13.781270238170892
Perplexity = 2450, P / Q = 13.700610628769134
Perplexity = 2470, P / Q = 13.620355637339001
Perplexity = 2490, P / Q = 13.540498038966923
Perplexity = 2510, P / Q = 13.461015209025
Perplexity = 2530, P / Q = 13.381910105272869
Perplexity = 2550, P / Q = 13.303173333613254
Perplexity = 2570, P / Q = 13.224791684828821
Perplexity = 2590, P / Q = 13.146759973419847
Perplexity = 2610, P / Q = 13.069068939060024
Perplexity = 2630, P / Q = 12.991710158223407
Perplexity = 2650, P / Q = 12.9146777402688
Perplexity = 2670, P / Q = 12.837956728445459
Perplexity = 2690, P / Q = 12.761550566291154
Perplexity = 2710, P / Q = 12.685443576955436
Perplexity = 2730, P / Q = 12.609637262431178
Perplexity = 2750, P / Q = 12.534117198545532
Perplexity = 2770, P / Q = 12.458884300178745
Perplexity = 2790, P / Q = 12.383923211045158
Perplexity = 2810, P / Q = 12.309234180252139
Perplexity = 2830, P / Q = 12.234817154536696
Perplexity = 2850, P / Q = 12.160655159895299
Perplexity = 2870, P / Q = 12.086752727118043
Perplexity = 2890, P / Q = 12.013100453810091
Perplexity = 2910, P / Q = 11.939699525147475
Perplexity = 2930, P / Q = 11.866548261004015
Perplexity = 2950, P / Q = 11.793640633285843
Perplexity = 2970, P / Q = 11.72098485232176
Perplexity = 2990, P / Q = 11.648589657634734
Perplexity = 3010, P / Q = 11.576475626396372
***********************************************************************

Working with early_exaggeration = 24
Perplexity = 10, P / Q = 459.07853028456486
Perplexity = 30, P / Q = 263.5006995035221
Perplexity = 50, P / Q = 203.09525549322152
Perplexity = 70, P / Q = 171.18927951273605
Perplexity = 90, P / Q = 150.7795635967829
Perplexity = 110, P / Q = 136.22747485204496
Perplexity = 130, P / Q = 125.12661162313425
Perplexity = 150, P / Q = 116.2755953542247
Perplexity = 170, P / Q = 108.99565080548723
Perplexity = 190, P / Q = 102.86946080622344
Perplexity = 210, P / Q = 97.62468601519188
Perplexity = 230, P / Q = 93.07504520858892
Perplexity = 250, P / Q = 89.08690866803533
Perplexity = 270, P / Q = 85.55907784943967
Perplexity = 290, P / Q = 82.41154835478783
Perplexity = 310, P / Q = 79.58045710125138
Perplexity = 330, P / Q = 77.01559835058401
Perplexity = 350, P / Q = 74.6777192126322
Perplexity = 370, P / Q = 72.53583759699454
Perplexity = 390, P / Q = 70.56505387181767
Perplexity = 410, P / Q = 68.74488402480966
Perplexity = 430, P / Q = 67.0581341241725
Perplexity = 450, P / Q = 65.49019600234358
Perplexity = 470, P / Q = 64.02857274167944
Perplexity = 490, P / Q = 62.66241069145524
Perplexity = 510, P / Q = 61.38222192713837
Perplexity = 530, P / Q = 60.179554360132094
Perplexity = 550, P / Q = 59.04693892887219
Perplexity = 570, P / Q = 57.97768204808007
Perplexity = 590, P / Q = 56.965876995791085
Perplexity = 610, P / Q = 56.00633239825421
Perplexity = 630, P / Q = 55.094500194461524
Perplexity = 650, P / Q = 54.22643335282434
Perplexity = 670, P / Q = 53.39877157178489
Perplexity = 690, P / Q = 52.60852307702589
Perplexity = 710, P / Q = 51.85320695203054
Perplexity = 730, P / Q = 51.13068548736071
Perplexity = 750, P / Q = 50.439134499596385
Perplexity = 770, P / Q = 49.77699880492394
Perplexity = 790, P / Q = 49.14300139766182
Perplexity = 810, P / Q = 48.53604349684284
Perplexity = 830, P / Q = 47.95522108693877
Perplexity = 850, P / Q = 47.39970909986958
Perplexity = 870, P / Q = 46.86874843994142
Perplexity = 890, P / Q = 46.36141693501791
Perplexity = 910, P / Q = 45.87662076402188
Perplexity = 930, P / Q = 45.412852268883725
Perplexity = 950, P / Q = 44.96835761384604
Perplexity = 970, P / Q = 44.54134787515136
Perplexity = 990, P / Q = 44.13009493005343
Perplexity = 1010, P / Q = 43.73311637156852
Perplexity = 1030, P / Q = 43.34908097820689
Perplexity = 1050, P / Q = 42.976854257116344
Perplexity = 1070, P / Q = 42.61546349850179
Perplexity = 1090, P / Q = 42.26405056425414
Perplexity = 1110, P / Q = 41.9218795084986
Perplexity = 1130, P / Q = 41.588288651238244
Perplexity = 1150, P / Q = 41.26269656106403
Perplexity = 1170, P / Q = 40.94456663642095
Perplexity = 1190, P / Q = 40.633460469121275
Perplexity = 1210, P / Q = 40.32892280524497
Perplexity = 1230, P / Q = 40.03059797074626
Perplexity = 1250, P / Q = 39.7381271693015
Perplexity = 1270, P / Q = 39.451193648113765
Perplexity = 1290, P / Q = 39.16951536403222
Perplexity = 1310, P / Q = 38.89280552328953
Perplexity = 1330, P / Q = 38.620847959729026
Perplexity = 1350, P / Q = 38.35338860056343
Perplexity = 1370, P / Q = 38.090228467641076
Perplexity = 1390, P / Q = 37.83117956447689
Perplexity = 1410, P / Q = 37.576049793584694
Perplexity = 1430, P / Q = 37.324678499562864
Perplexity = 1450, P / Q = 37.076897861578956
Perplexity = 1470, P / Q = 36.83256169659133
Perplexity = 1490, P / Q = 36.591540808226235
Perplexity = 1510, P / Q = 36.35370500117671
Perplexity = 1530, P / Q = 36.11891429777584
Perplexity = 1550, P / Q = 35.88707740500798
Perplexity = 1570, P / Q = 35.658072141040236
Perplexity = 1590, P / Q = 35.43179977470677
Perplexity = 1610, P / Q = 35.20817350975183
Perplexity = 1630, P / Q = 34.987087949617646
Perplexity = 1650, P / Q = 34.76845704723892
Perplexity = 1670, P / Q = 34.55220566456086
Perplexity = 1690, P / Q = 34.33826476545862
Perplexity = 1710, P / Q = 34.126539376636934
Perplexity = 1730, P / Q = 33.916973392877196
Perplexity = 1750, P / Q = 33.70950754569661
Perplexity = 1770, P / Q = 33.50405935899871
Perplexity = 1790, P / Q = 33.300587115857226
Perplexity = 1810, P / Q = 33.099010060033855
Perplexity = 1830, P / Q = 32.89928821158812
Perplexity = 1850, P / Q = 32.70135936273972
Perplexity = 1870, P / Q = 32.50518920653765
Perplexity = 1890, P / Q = 32.31071698931698
Perplexity = 1910, P / Q = 32.11788983461216
Perplexity = 1930, P / Q = 31.92667734008464
Perplexity = 1950, P / Q = 31.73701314502017
Perplexity = 1970, P / Q = 31.54888613683326
Perplexity = 1990, P / Q = 31.36222487728943
Perplexity = 2010, P / Q = 31.17701727662934
Perplexity = 2030, P / Q = 30.99320523832333
Perplexity = 2050, P / Q = 30.810754969806883
Perplexity = 2070, P / Q = 30.629638873353095
Perplexity = 2090, P / Q = 30.44981532974324
Perplexity = 2110, P / Q = 30.271262070605687
Perplexity = 2130, P / Q = 30.09393508401532
Perplexity = 2150, P / Q = 29.917812622534946
Perplexity = 2170, P / Q = 29.742853881141677
Perplexity = 2190, P / Q = 29.56904019296015
Perplexity = 2210, P / Q = 29.396324867312742
Perplexity = 2230, P / Q = 29.22470563183512
Perplexity = 2250, P / Q = 29.05413935447181
Perplexity = 2270, P / Q = 28.884606775313564
Perplexity = 2290, P / Q = 28.71607631529327
Perplexity = 2310, P / Q = 28.54853082292817
Perplexity = 2330, P / Q = 28.381943959056198
Perplexity = 2350, P / Q = 28.21627787500789
Perplexity = 2370, P / Q = 28.051538429163067
Perplexity = 2390, P / Q = 27.887681890144197
Perplexity = 2410, P / Q = 27.724690360085336
Perplexity = 2430, P / Q = 27.562540476341784
Perplexity = 2450, P / Q = 27.401221257538268
Perplexity = 2470, P / Q = 27.240711274678002
Perplexity = 2490, P / Q = 27.080996077933847
Perplexity = 2510, P / Q = 26.92203041805
Perplexity = 2530, P / Q = 26.763820210545738
Perplexity = 2550, P / Q = 26.606346667226507
Perplexity = 2570, P / Q = 26.449583369657642
Perplexity = 2590, P / Q = 26.293519946839695
Perplexity = 2610, P / Q = 26.138137878120048
Perplexity = 2630, P / Q = 25.983420316446814
Perplexity = 2650, P / Q = 25.8293554805376
Perplexity = 2670, P / Q = 25.675913456890918
Perplexity = 2690, P / Q = 25.52310113258231
Perplexity = 2710, P / Q = 25.37088715391087
Perplexity = 2730, P / Q = 25.219274524862357
Perplexity = 2750, P / Q = 25.068234397091064
Perplexity = 2770, P / Q = 24.91776860035749
Perplexity = 2790, P / Q = 24.767846422090315
Perplexity = 2810, P / Q = 24.618468360504277
Perplexity = 2830, P / Q = 24.469634309073392
Perplexity = 2850, P / Q = 24.321310319790598
Perplexity = 2870, P / Q = 24.173505454236086
Perplexity = 2890, P / Q = 24.026200907620183
Perplexity = 2910, P / Q = 23.87939905029495
Perplexity = 2930, P / Q = 23.73309652200803
Perplexity = 2950, P / Q = 23.587281266571686
Perplexity = 2970, P / Q = 23.44196970464352
Perplexity = 2990, P / Q = 23.297179315269467
Perplexity = 3010, P / Q = 23.152951252792743
***********************************************************************

We can see that increasing the early exaggeration can in principle prevent the P become close to Q. However, as we saw above this still did not help to properly initialize the tSNE algorithm. We need to further understand why the KL-gradient still disappears even at large early exaggerations. Let uu implement a similar procedure as above but this time for the KL-gradient itself and not the P / Q ratio. It is possible that the other two factors in the expression of KL-gradient, i.e. not only the P-Q difference, contribute to the negligible contribution from the KL-gradinet compared to the initialization with PCA.

In [17]:
from sklearn.decomposition import PCA
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.manifold._t_sne import _joint_probabilities, _kl_divergence

X_train = X_swiss_roll; n = X_train.shape[0]
X_reduced = PCA(n_components = 2).fit_transform(X_train)
dist = np.square(euclidean_distances(X_train, X_train))

PERP_START = 10; PERP_STEP = 250; N_LOW_DIMS = 2
KL_grad_list = []
for EARLY_EXAGGERATION in [1, 4, 12, 24]:
    print('Working with early_exaggeration = {}'.format(EARLY_EXAGGERATION))
    KL_grad = []
    for PERP in range(PERP_START, X_train.shape[0], PERP_STEP):
        P = _joint_probabilities(distances = dist, desired_perplexity = PERP, verbose = 0)
        P = EARLY_EXAGGERATION * P
    
        kl, grad = _kl_divergence(params = X_reduced, P = P, n_samples = n, n_components = 2, 
                                  degrees_of_freedom = 1)
        KL_grad_norm = np.linalg.norm(grad)
        KL_grad.append(KL_grad_norm)
        print('Perplexity = {0}, KL_grad = {1}'.format(PERP, KL_grad_norm))
    KL_grad_list.append(KL_grad)
    print('***********************************************************************\n')

plt.figure(figsize = (20, 15))
for i in range(len([1, 4, 12, 24])):
    plt.plot(range(PERP_START, X_train.shape[0], PERP_STEP), KL_grad_list[i], '-o')

plt.hlines(0, PERP_START, X_train.shape[0], colors = 'red')
plt.gca().legend(('Early Exaggeration = 1', 'Early Exaggeration = 4', 
                  'Early Exaggeration = 12', 'Early Exaggeration = 24'), fontsize = 20)
plt.title("tSNE: KL-GRADIENT at Different Perplexities and Early Exaggerations", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("KL-GRADIENT", fontsize = 20)
plt.show()
Working with early_exaggeration = 1
Perplexity = 10, KL_grad = 0.01790444740327198
Perplexity = 260, KL_grad = 0.0174541936137467
Perplexity = 510, KL_grad = 0.01666943138811564
Perplexity = 760, KL_grad = 0.015600747064296884
Perplexity = 1010, KL_grad = 0.014283022596581537
Perplexity = 1260, KL_grad = 0.012741957397629618
Perplexity = 1510, KL_grad = 0.011130845419151053
Perplexity = 1760, KL_grad = 0.009483406482585357
Perplexity = 2010, KL_grad = 0.007787890825052564
Perplexity = 2260, KL_grad = 0.006006183899333451
Perplexity = 2510, KL_grad = 0.004060802624239353
Perplexity = 2760, KL_grad = 0.0017859059069286693
Perplexity = 3010, KL_grad = 0.0026544201932151117
***********************************************************************

Working with early_exaggeration = 4
Perplexity = 10, KL_grad = 0.017892832648578133
Perplexity = 260, KL_grad = 0.016872538976059684
Perplexity = 510, KL_grad = 0.015336202984074593
Perplexity = 760, KL_grad = 0.014405223930178148
Perplexity = 1010, KL_grad = 0.014319527852624416
Perplexity = 1260, KL_grad = 0.014215647712196433
Perplexity = 1510, KL_grad = 0.016087562423158765
Perplexity = 1760, KL_grad = 0.019932251545874278
Perplexity = 2010, KL_grad = 0.025106817486039927
Perplexity = 2260, KL_grad = 0.03125570377335557
Perplexity = 2510, KL_grad = 0.038435168363966335
Perplexity = 2760, KL_grad = 0.04734191641505803
Perplexity = 3010, KL_grad = 0.0637188204885116
***********************************************************************

Working with early_exaggeration = 12
Perplexity = 10, KL_grad = 0.017951182514457733
Perplexity = 260, KL_grad = 0.021012300340380374
Perplexity = 510, KL_grad = 0.027461329932529633
Perplexity = 760, KL_grad = 0.03940966663291045
Perplexity = 1010, KL_grad = 0.05278407546814124
Perplexity = 1260, KL_grad = 0.06392389382711235
Perplexity = 1510, KL_grad = 0.07707798485802764
Perplexity = 1760, KL_grad = 0.09247212383106654
Perplexity = 2010, KL_grad = 0.10978072601628519
Perplexity = 2260, KL_grad = 0.12903683904493266
Perplexity = 2510, KL_grad = 0.15093366931881053
Perplexity = 2760, KL_grad = 0.17779352435055398
Perplexity = 3010, KL_grad = 0.22692349175137477
***********************************************************************

Working with early_exaggeration = 24
Perplexity = 10, KL_grad = 0.01827867875461508
Perplexity = 260, KL_grad = 0.03535119736121583
Perplexity = 510, KL_grad = 0.05707974755537516
Perplexity = 760, KL_grad = 0.0866157104841047
Perplexity = 1010, KL_grad = 0.11700264819319746
Perplexity = 1260, KL_grad = 0.1421858843964194
Perplexity = 1510, KL_grad = 0.17023530255733732
Perplexity = 1760, KL_grad = 0.2019506945626486
Perplexity = 2010, KL_grad = 0.2370445842773153
Perplexity = 2260, KL_grad = 0.2757979776187948
Perplexity = 2510, KL_grad = 0.31970888263557845
Perplexity = 2760, KL_grad = 0.37347776309532793
Perplexity = 3010, KL_grad = 0.4717359740930772
***********************************************************************

Let us now check that tSNE degrades to PCA for the scRNAseq CAFs data set:

In [1]:
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from sklearn.metrics.pairwise import euclidean_distances

path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)
                1110020A21Rik  1110046J04Rik  1190002F15Rik  1500015A07Rik
SS2_15_0048_A3            0.0            0.0            0.0            0.0
SS2_15_0048_A6            0.0            0.0            0.0            0.0
SS2_15_0048_A5            0.0            0.0            0.0            0.0
SS2_15_0048_A4            0.0            0.0            0.0            0.0

This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)
In [5]:
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X_train)
plt.figure(figsize = (20,15))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c = y_train, s = 50)
plt.title('PCA: CAFs scRNAseq', fontsize = 20)
plt.xlabel("PC1", fontsize = 20); plt.ylabel("PC2", fontsize = 20)
plt.show()
In [29]:
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 700, 
             n_iter = 1000, verbose = 2, early_exaggeration = 12, method = 'exact', init = X_reduced)
tsne = model.fit_transform(X_train)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y_train, s = 50)
plt.title('tSNE: CAFs scRNAseq', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 716 / 716
[t-SNE] Mean sigma: 17.383014
[t-SNE] Iteration 50: error = 35.2594246, gradient norm = 0.3417314 (50 iterations in 0.643s)
[t-SNE] Iteration 100: error = 33.3934406, gradient norm = 0.2181094 (50 iterations in 0.683s)
[t-SNE] Iteration 150: error = 33.7467567, gradient norm = 0.1884981 (50 iterations in 0.682s)
[t-SNE] Iteration 200: error = 33.8809651, gradient norm = 0.1636423 (50 iterations in 0.676s)
[t-SNE] Iteration 250: error = 34.4723053, gradient norm = 0.1793141 (50 iterations in 0.670s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 34.472305
[t-SNE] Iteration 300: error = 0.1392892, gradient norm = 0.0030119 (50 iterations in 0.671s)
[t-SNE] Iteration 350: error = 0.1148490, gradient norm = 0.0001874 (50 iterations in 0.672s)
[t-SNE] Iteration 400: error = 0.1148120, gradient norm = 0.0000305 (50 iterations in 0.680s)
[t-SNE] Iteration 450: error = 0.1148088, gradient norm = 0.0000016 (50 iterations in 0.678s)
[t-SNE] Iteration 500: error = 0.1148088, gradient norm = 0.0000001 (50 iterations in 0.679s)
[t-SNE] Iteration 500: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 500 iterations: 0.114809

Simulate Data with Clusters for tSNE

In [2]:
import numpy as np
import matplotlib.pyplot as plt

n = 1000
p = 2

X1 = np.random.randn(n, p)
X2 = X1 + np.max(X1)*1.2
X3 = X1 + np.max(X1)*4
X = np.vstack([X1,X2,X3])
y = np.array([['blue']*n, ['red']*n, ['green']*n])
y = y.flatten()

plt.figure(figsize = (20,15))
plt.scatter(X[:, 0], X[:, 1], c = y, s = 50)
plt.title('Original Data', fontsize = 20)
plt.xlabel("Dimension 1", fontsize = 20); plt.ylabel("Dimension 2", fontsize = 20)
plt.show()
In [8]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

pca = PCA(n_components = 2).fit_transform(X)

plt.figure(figsize = (20,15))
for index, perp in enumerate([10, 30, 100, 1000]):
    print('Working with Perplexity = {}'.format(perp))
    model = TSNE(learning_rate = 200, n_components = 2, perplexity = perp, 
                 n_iter = 1000, verbose = 0, init = pca)
    tsne = model.fit_transform(X)
    
    plt.subplot(221 + index)
    plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
    plt.title('tSNE: Perplexity = {}'.format(perp), fontsize = 25)
    plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)

plt.show()
Working with Perplexity = 10
Working with Perplexity = 30
Working with Perplexity = 100
Working with Perplexity = 1000
In [ ]:
 
In [3]:
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X)
In [4]:
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 200, n_components = 2, perplexity = 10, n_iter = 1000, verbose = 2, init = X_reduced)
tsne = model.fit_transform(X)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE: Perplexity = 10', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 3000 samples in 0.003s...
[t-SNE] Computed neighbors for 3000 samples in 0.023s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
[t-SNE] Computed conditional probabilities for sample 2000 / 3000
[t-SNE] Computed conditional probabilities for sample 3000 / 3000
[t-SNE] Mean sigma: 0.117393
[t-SNE] Computed conditional probabilities in 0.072s
[t-SNE] Iteration 50: error = 66.4602966, gradient norm = 0.0153724 (50 iterations in 1.042s)
[t-SNE] Iteration 100: error = 65.0424652, gradient norm = 0.0080951 (50 iterations in 0.948s)
[t-SNE] Iteration 150: error = 64.4218597, gradient norm = 0.0052292 (50 iterations in 0.956s)
[t-SNE] Iteration 200: error = 64.0749893, gradient norm = 0.0049253 (50 iterations in 0.912s)
[t-SNE] Iteration 250: error = 63.8488998, gradient norm = 0.0050150 (50 iterations in 1.049s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 63.848900
[t-SNE] Iteration 300: error = 1.5461835, gradient norm = 0.0012460 (50 iterations in 0.900s)
[t-SNE] Iteration 350: error = 1.0846360, gradient norm = 0.0005697 (50 iterations in 0.657s)
[t-SNE] Iteration 400: error = 0.8999841, gradient norm = 0.0003479 (50 iterations in 0.880s)
[t-SNE] Iteration 450: error = 0.8063845, gradient norm = 0.0002500 (50 iterations in 0.616s)
[t-SNE] Iteration 500: error = 0.7508674, gradient norm = 0.0002004 (50 iterations in 0.843s)
[t-SNE] Iteration 550: error = 0.7152075, gradient norm = 0.0001594 (50 iterations in 0.635s)
[t-SNE] Iteration 600: error = 0.6905052, gradient norm = 0.0001409 (50 iterations in 0.630s)
[t-SNE] Iteration 650: error = 0.6721554, gradient norm = 0.0001298 (50 iterations in 0.682s)
[t-SNE] Iteration 700: error = 0.6581650, gradient norm = 0.0001202 (50 iterations in 0.622s)
[t-SNE] Iteration 750: error = 0.6470391, gradient norm = 0.0001077 (50 iterations in 0.704s)
[t-SNE] Iteration 800: error = 0.6377627, gradient norm = 0.0000977 (50 iterations in 0.686s)
[t-SNE] Iteration 850: error = 0.6298910, gradient norm = 0.0000994 (50 iterations in 0.648s)
[t-SNE] Iteration 900: error = 0.6234340, gradient norm = 0.0000950 (50 iterations in 0.709s)
[t-SNE] Iteration 950: error = 0.6180505, gradient norm = 0.0000986 (50 iterations in 0.666s)
[t-SNE] Iteration 1000: error = 0.6138580, gradient norm = 0.0000855 (50 iterations in 0.663s)
[t-SNE] KL divergence after 1000 iterations: 0.613858
In [5]:
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 200, n_components = 2, perplexity = 30, n_iter = 1000, verbose = 2, init = X_reduced)
tsne = model.fit_transform(X)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE: Perplexity = 30', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 3000 samples in 0.001s...
[t-SNE] Computed neighbors for 3000 samples in 0.060s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
[t-SNE] Computed conditional probabilities for sample 2000 / 3000
[t-SNE] Computed conditional probabilities for sample 3000 / 3000
[t-SNE] Mean sigma: 0.209391
[t-SNE] Computed conditional probabilities in 0.167s
[t-SNE] Iteration 50: error = 60.0897217, gradient norm = 0.0025214 (50 iterations in 1.305s)
[t-SNE] Iteration 100: error = 59.9924812, gradient norm = 0.0027907 (50 iterations in 1.314s)
[t-SNE] Iteration 150: error = 59.9354324, gradient norm = 0.0014755 (50 iterations in 1.345s)
[t-SNE] Iteration 200: error = 59.8856163, gradient norm = 0.0018201 (50 iterations in 1.375s)
[t-SNE] Iteration 250: error = 59.8438492, gradient norm = 0.0023785 (50 iterations in 1.287s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 59.843849
[t-SNE] Iteration 300: error = 1.0661454, gradient norm = 0.0010017 (50 iterations in 0.922s)
[t-SNE] Iteration 350: error = 0.8057499, gradient norm = 0.0004035 (50 iterations in 0.765s)
[t-SNE] Iteration 400: error = 0.6976930, gradient norm = 0.0002331 (50 iterations in 0.792s)
[t-SNE] Iteration 450: error = 0.6432291, gradient norm = 0.0001673 (50 iterations in 0.837s)
[t-SNE] Iteration 500: error = 0.6128566, gradient norm = 0.0001222 (50 iterations in 0.832s)
[t-SNE] Iteration 550: error = 0.5944798, gradient norm = 0.0000988 (50 iterations in 1.136s)
[t-SNE] Iteration 600: error = 0.5816314, gradient norm = 0.0000855 (50 iterations in 0.806s)
[t-SNE] Iteration 650: error = 0.5728312, gradient norm = 0.0000720 (50 iterations in 0.821s)
[t-SNE] Iteration 700: error = 0.5664450, gradient norm = 0.0000656 (50 iterations in 0.814s)
[t-SNE] Iteration 750: error = 0.5612496, gradient norm = 0.0000593 (50 iterations in 0.979s)
[t-SNE] Iteration 800: error = 0.5570272, gradient norm = 0.0000538 (50 iterations in 0.849s)
[t-SNE] Iteration 850: error = 0.5537688, gradient norm = 0.0000512 (50 iterations in 0.849s)
[t-SNE] Iteration 900: error = 0.5509560, gradient norm = 0.0000467 (50 iterations in 0.836s)
[t-SNE] Iteration 950: error = 0.5483001, gradient norm = 0.0000456 (50 iterations in 0.864s)
[t-SNE] Iteration 1000: error = 0.5460615, gradient norm = 0.0000423 (50 iterations in 0.825s)
[t-SNE] KL divergence after 1000 iterations: 0.546061
In [6]:
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 200, n_components = 2, perplexity = 100, n_iter = 1000, verbose = 2, init = X_reduced)
tsne = model.fit_transform(X)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE: Perplexity = 100', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing 301 nearest neighbors...
[t-SNE] Indexed 3000 samples in 0.001s...
[t-SNE] Computed neighbors for 3000 samples in 0.178s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
[t-SNE] Computed conditional probabilities for sample 2000 / 3000
[t-SNE] Computed conditional probabilities for sample 3000 / 3000
[t-SNE] Mean sigma: 0.391522
[t-SNE] Computed conditional probabilities in 0.591s
[t-SNE] Iteration 50: error = 53.7492447, gradient norm = 0.0007004 (50 iterations in 2.588s)
[t-SNE] Iteration 100: error = 53.7483215, gradient norm = 0.0002384 (50 iterations in 2.218s)
[t-SNE] Iteration 150: error = 53.7474632, gradient norm = 0.0002408 (50 iterations in 2.145s)
[t-SNE] Iteration 200: error = 53.7468491, gradient norm = 0.0003052 (50 iterations in 2.254s)
[t-SNE] Iteration 250: error = 53.7462502, gradient norm = 0.0007010 (50 iterations in 2.097s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 53.746250
[t-SNE] Iteration 300: error = 0.6265198, gradient norm = 0.0006932 (50 iterations in 1.413s)
[t-SNE] Iteration 350: error = 0.5051948, gradient norm = 0.0002923 (50 iterations in 1.261s)
[t-SNE] Iteration 400: error = 0.4500135, gradient norm = 0.0001624 (50 iterations in 1.327s)
[t-SNE] Iteration 450: error = 0.4216782, gradient norm = 0.0001060 (50 iterations in 1.371s)
[t-SNE] Iteration 500: error = 0.4053376, gradient norm = 0.0000765 (50 iterations in 1.408s)
[t-SNE] Iteration 550: error = 0.3947235, gradient norm = 0.0000623 (50 iterations in 1.495s)
[t-SNE] Iteration 600: error = 0.3874213, gradient norm = 0.0000505 (50 iterations in 1.459s)
[t-SNE] Iteration 650: error = 0.3823698, gradient norm = 0.0000428 (50 iterations in 1.498s)
[t-SNE] Iteration 700: error = 0.3788204, gradient norm = 0.0000373 (50 iterations in 1.505s)
[t-SNE] Iteration 750: error = 0.3762039, gradient norm = 0.0000337 (50 iterations in 1.470s)
[t-SNE] Iteration 800: error = 0.3741367, gradient norm = 0.0000293 (50 iterations in 1.507s)
[t-SNE] Iteration 850: error = 0.3725063, gradient norm = 0.0000266 (50 iterations in 1.476s)
[t-SNE] Iteration 900: error = 0.3711408, gradient norm = 0.0000265 (50 iterations in 1.526s)
[t-SNE] Iteration 950: error = 0.3700801, gradient norm = 0.0000248 (50 iterations in 1.510s)
[t-SNE] Iteration 1000: error = 0.3692252, gradient norm = 0.0000233 (50 iterations in 1.583s)
[t-SNE] KL divergence after 1000 iterations: 0.369225
In [7]:
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 200, n_components = 2, perplexity = 1000, n_iter = 1000, verbose = 2, 
             init = X_reduced)
tsne = model.fit_transform(X)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE: Perplexity = 1000', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
[t-SNE] Computing 2999 nearest neighbors...
[t-SNE] Indexed 3000 samples in 0.000s...
[t-SNE] Computed neighbors for 3000 samples in 1.032s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
[t-SNE] Computed conditional probabilities for sample 2000 / 3000
[t-SNE] Computed conditional probabilities for sample 3000 / 3000
[t-SNE] Mean sigma: 2.744627
[t-SNE] Computed conditional probabilities in 4.457s
[t-SNE] Iteration 50: error = 32.8451157, gradient norm = 0.0224931 (50 iterations in 7.206s)
[t-SNE] Iteration 100: error = 33.6330986, gradient norm = 0.0001021 (50 iterations in 5.606s)
[t-SNE] Iteration 150: error = 33.6330681, gradient norm = 0.0000829 (50 iterations in 5.496s)
[t-SNE] Iteration 200: error = 33.6329575, gradient norm = 0.0000742 (50 iterations in 5.469s)
[t-SNE] Iteration 250: error = 33.6328583, gradient norm = 0.0000797 (50 iterations in 5.370s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 33.632858
[t-SNE] Iteration 300: error = 0.0514678, gradient norm = 0.0010783 (50 iterations in 5.006s)
[t-SNE] Iteration 350: error = 0.0400803, gradient norm = 0.0001059 (50 iterations in 5.698s)
[t-SNE] Iteration 400: error = 0.0394289, gradient norm = 0.0000361 (50 iterations in 5.695s)
[t-SNE] Iteration 450: error = 0.0386355, gradient norm = 0.0000327 (50 iterations in 5.627s)
[t-SNE] Iteration 500: error = 0.0377427, gradient norm = 0.0000253 (50 iterations in 5.703s)
[t-SNE] Iteration 550: error = 0.0370255, gradient norm = 0.0000179 (50 iterations in 5.730s)
[t-SNE] Iteration 600: error = 0.0366180, gradient norm = 0.0000179 (50 iterations in 5.903s)
[t-SNE] Iteration 650: error = 0.0362002, gradient norm = 0.0000151 (50 iterations in 5.888s)
[t-SNE] Iteration 700: error = 0.0359721, gradient norm = 0.0000125 (50 iterations in 6.124s)
[t-SNE] Iteration 750: error = 0.0356999, gradient norm = 0.0000108 (50 iterations in 6.611s)
[t-SNE] Iteration 800: error = 0.0355202, gradient norm = 0.0000094 (50 iterations in 6.016s)
[t-SNE] Iteration 850: error = 0.0353795, gradient norm = 0.0000100 (50 iterations in 5.957s)
[t-SNE] Iteration 900: error = 0.0351716, gradient norm = 0.0000104 (50 iterations in 6.002s)
[t-SNE] Iteration 950: error = 0.0350280, gradient norm = 0.0000087 (50 iterations in 5.947s)
[t-SNE] Iteration 1000: error = 0.0349660, gradient norm = 0.0000077 (50 iterations in 5.960s)
[t-SNE] KL divergence after 1000 iterations: 0.034966

Now let us make tSNE plots at different perplexities for the World Map data set:

In [9]:
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader

shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
                                        category='cultural', name=shapename)

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    #print(country.attributes['NAME_LONG'])
    if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('NorthAmerica.png')
plt.close()
        
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela', 
                                           'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('SouthAmerica.png')
plt.close()

plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Australia']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Australia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia', 
                                          'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden', 
                                           'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway', 
                                           'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria', 
                                           'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland', 
                                           'United Kingdom', 'Saudi Arabia', 'Hungary']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Eurasia.png')
plt.close()


plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records(): 
    if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
                                           'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia', 
                                           'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
                                           'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic', 
                                           'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia', 
                                           'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa', 
                                           'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
                                           'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 
                                           'Nigeria', 'Cameroon']:
        ax.add_geometries(country.geometry, ccrs.Miller(),
                          label=country.attributes['NAME_LONG'], color = 'black')
        plt.savefig('Africa.png')
plt.close()


rng = np.random.RandomState(123)
plt.figure(figsize = (20,15))

N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)

N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)

N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)

N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)

N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)

plt.title('Original World Map Data Set', fontsize = 25)
plt.xlabel('Dimension 1', fontsize = 22); plt.ylabel('Dimension 2', fontsize = 22)

X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)

plt.show()
(3023, 2)
(3023,)
In [16]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

pca = PCA(n_components = 2).fit_transform(X)

plt.figure(figsize = (20,15))
plt.subplot(221 + 0)
plt.scatter(X[:, 0], X[:, 1], c = y, s = 50)
plt.title('Original World Map Data Set', fontsize = 25)
plt.xlabel("Dimension 1", fontsize = 20); plt.ylabel("Dimension 2", fontsize = 20)

for index, perp in enumerate([500, 1000, 2000]):
    print('Working with Perplexity = {}'.format(perp))
    model = TSNE(learning_rate = 200, n_components = 2, perplexity = perp, 
                 n_iter = 1000, verbose = 0, init = pca)
    tsne = model.fit_transform(X)
    
    plt.subplot(221 + index + 1)
    plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
    plt.title('tSNE: Perplexity = {}'.format(perp), fontsize = 25)
    plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)

plt.show()
Working with Perplexity = 500
Working with Perplexity = 1000
Working with Perplexity = 2000
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: